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UNCORRELATED LINEAR COMPOSITES MAXIMALLY 
RELATED TO A COMPLEX OF 
CORRELATED OBSERVATIONS! 


HENRY Е. KAISER 
University of Wisconsin 


In developing a battery of tests, it is often desired, for technical 
or for conceptual reasons, to have test scores that are mutually 
uncorrelated, This note solves mathematically for p linear composites 
of the observations on p variables (the scores on a battery of р tests, 
say), composites which are mutually uncorrelated, and each of 
which is paired with the observations on one of the original variables 
such that the sum of the p pair-wise correlations between the observed 
Variables and composites is maximized. 

"We consider then à n X р score matrix Z, giving the observations 

~of n persons on p variables, standardized so that each variable’s 
Observations has mean zero and variance one. Let X be another 
n X p score matrix consisting of scores which are linear combinations 
of the original observations Z, i.e., 


Х' = CZ, (1) 


Where X’ and Z’ are the transposes of X and Z, and where C is a 
nonsingular p X p matrix of constant coefficients, Also, let the 
Observations X be standardized and mutually uncorrelated. Ob- 
viously 


Z' = AX’ (2) 


eT 

1This problem was suggested to me by Professor Robert E. Grinder. 
Computations were done on the CDC 1604 of the University of Wisconsin 
Computing Center under a grant from the University of Wisconsin Research 
Committee. Since this Center is partially supported by the National Science 
Foundation and by the Wisconsin Alumni Research Foundation, these agencies 
are acknowledged. Mr. Edgar Arendt assisted in the computations. 
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where | 
Ae m A", (3). 
Now, postmultiply (2) by X: 
2'Х = АХ'Х (4) 
and, upon dividing by п, the number of persons, 
Z'X/n = A, (5) 


since X’X/n = I, as the scores X are standardized and uncorrelated; 
thus, A is the cross-correlation matrix between the Z's and the X’s, 
In the terminology of factor analysis, A is a factor matrix, and the 
columns of X are observations on uncorrelated factors (or com- 
ponents). It is well known that A is not unique. 

Let us choose A such that | 


trace (4) = maximum, (6) 


of the correlations over all p pairs as large as possible. Let A be an 
approximation to A, where the columns of A represent the uncor- 
related composités, and let d; and d, be the jth and kth columns of 
А. Taking out these two columns, and postmultiplying them by 
the orthonormal transformation | 


80 that each column of Z is paired with a column of X, with the sum | 


ў 


us —sin ‘ @) | 


sing  cosó 
we obtain new diagonal elements, d;; and à: 
G; = Âi; coso + dj, sin $, (8) 


Gy = — âu sin ф + ân cos $, 
whose sum 


8 = @ + Gx = (dj; + бы) cos + (0, — 4.) sino, (9 
according to (6), is to be a maximum, 
Now, the derivative of (9), with respect to ¢, is 


ds 3 
am — (@,; + dy) sin ¢ + (4, — âri) cos $. (10) 
Setting this derivative equal to zero, we find 


an — d 
tan$ = 4 бы, 11 
* dj; + б Oh 
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From (11) we see then that a maximum for (0) will occur only if 
dj, = бы for all j and k, i.e., only when A is symmetric. It is obvious 
that the sufficient condition for a maximum is to choose the diagonals 
of A positive. 

Let R — Z'Z/n be the correlation matrix of the original observa- 
tions, with unit-length column eigenvectors given by the orthonormal 
matrix О and eigenvalues given by the diagonal matrix М *; thus 
the principal axes decomposition of R is 
Е = QM'Q'. (12) 
А = QUQ, аз) 
where M is the diagonal matrix of positive square roots of the eigen- 


values in M?. A is obviously symmetric, and clearly is a factoring of 
R since 
AA’ = QMQ'(QMQ'Y = QMQ'QMQ' = QM*Q'=R (М) 
because of the symmetry of QMQ’, the orthonormality of Q, and (12). 
The weights A are to be applied to the derived linear composites X 
to yield the original observations Z, as in (2); to obtain the weights 
C to be applied to the original observations Z to yield the desired 
uncorrelated composites X, as in (1), remember (3) and the ortho- 
normality of Q, and find 


: € = А" = QMQ)* = Омо. (15) 


Example. Consider the observations of n = 5 persons on p = 3 
tests: 


Let 


(16) 


Ф мо л бо = 
н о 0o t t9 
оо to ч C 


These scores are standardized: 
—1.414 —.663  —.950 
.000 — —.663 JU 
Z-| 1414 1.548 —1382| (17) 
—.707 811 .346 
707 —1.032 1.209 
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The correlation matrix for the observations on the three variables ig 
1.000 365 «000 
Е = | 365 1.000 —.598|, (18) 
000 —.598 1.000. 
which has unit-length column eigenvectors 
368 .854 368 
Ө = | .707 .000 —.707|, (19) 
—.604 .521 —.604 
associated with eigenvalues 
1.701  .000 .000 
М = | 000 1.000 .000|. (20) 
-000 .000 .299 
Using (13) we find 
-980 197 088 
А = QMQ = | 197 9296 —.323|, Q1) 
033 —.323 946 


the symmetric matrix of cross-correlations between the observations 
on the original variables and to-be-derived uncorrelated scores X 
on the desired linear composites. Note again that, given uncorrelated 
observations X, the sum of the diagonals of A, .980 + .926 + 
946 = 2.852, is a maximum. From (15) the coefficients С are 


1.081 —.277 —.133 
С = QM" = | -.277 1.299 454 


—.183 454 1.218 


which, when premultiplied by the original standard scores in (17), 
yield the mutually uncorrelated standardized scores X: 


—1.219 —.901 —1.270 
081 —.508 645 

X=ZC=| 1.283 991 —1108|. (23) 
—1.035 1406 884 
890 —.987 0910 


(22) 
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AN EMPIRICAL COMPARISON OF p, AND pm AS INDICES 
OF RATER POLICY AGREEMENT 


JAMES C. NAYLOR, ARTHUR І. DUDYCHA Ако 
E. ALLEN SCHENCK 
Ohio State University 


In a recent paper by Naylor and Schenck (1966) the suggestion 
was made that the policy of a judge in a typical rating situation be 
defined by his least-squares regression equation. This definition is 
possible whenever the judge bases his ratings of a series of п stim- 
uli on k known and quantifiable characteristics or traits (X,). One 
of their main reasons for approaching judgment policies in this 
manner was to introduce a new method of expressing agreement or 
similarity between different judges’ respective policies. 

Perhaps the most traditional method for grouping raters has been 
on the basis of the agreement in their actual judgments, usually 
measured by a product-moment correlation. Thus, given p raters, 
one simply calculates the p(p—1)2 “rater agreement” correlations 
(ра) between each pair of raters. These correlations form a matrix, 
Ra, which can be factor analyzed and the judges, consequently, 
grouped according to similar patterns in their factor loadings. 

However, Naylor and Schenck (1966) have suggested the alterna- 
tive approach of (a) calculating a least-squares regression equation 
(Y, =a + bX, + bX. + +++ + ВХ») for each of the raters, 
(b) computing the predicted value for each stimulus for each of the 
raters and then (c) computing the correlations (pm) between their 
predicted judgments. It can be argued that such “policy agreement” 
correlations based upon predicted values are better indices of the 
similarity or “matching” of two raters’ policies and, therefore, are 
better suited for grouping raters in the above-mentioned factor- 
analytic manner. Assuming zero partial correlations of residuals, 
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Naylor and Schenck employed some of the results of Brunswikian 


multiple cue regression analysis (see Hursch, Hammond, and 


Hursch, 1964 for an introduction into this area of research) to 
show that the “rater agreement” between any two judges (A and 
B) is a function of their “policy agreement" and their respective 
linear consistencies. That is, 


Pa (4,8) = Рт(л,в) рарв, (1) 


where p, and рв are the multiple correlations between the traits 
(Ху) and the ratings of judges A and B, respectively (intra-judge 
consistency). 

It is clearly evident that p, will always be less than p, unless 
both raters are perfectly consistent. Indeed, if both raters formed 
their respective policies and consistently used them in making their 
judgments, rater agreement would be policy agreement. The above 
equation simply provides an algebraie expression of this fact. But, 
since judges are seldom so consistent in reality, it would seem quite 


desirable to group them on the basis of an index of agreement that : 


would be unencumbered by such inconsistencies, namely pm. 
The present experiment was designed to see if the use of pm in 


grouping raters by means of a factor-analytic technique would | 


result in "cleaner" and higher factor loadings, than through the use 
of ра. The actual stimuli and rater judgments necessary for the 
regression analysis and the calculation of the two correlation ma- 


trices Ra and Rm, were taken from an earlier study dealing with | 


certain problems in using artificial stimuli for judgment and attitude 
research (Dudycha and Naylor, 1966). 


Method 


The authors chose the typical attitude assessment or rating situ- 
ation, where a group of judges are asked to evaluate the same set 
of n stimuli (in this case, job profiles) on some defined criterion di- 
mension (in this instance, “job desirability”) and where each 


stimulus has (job) dimensions which are potentially relevant to | 
the judgment required of the raters, Accordingly the relative mer- | 


its of p, (the agreement between what the judges actually do) 
could be compared with those of pm (the agreement of their basic 
judgmental policies). Further, three different sets of stimuli were 


used to systematically vary the characteristics of the R matrix de- » 
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scribing the intercorrelations between the stimulus dimensions in 
order to examine the effect of variations in the underlying cue R 
matrix upon the ра and рь policy clustering comparisons, 


Profile Generation and Experimental Stimuli 


Three different sets of stimuli were generated from factor 
matrices derived from Wherry's (1954) factor analytie study of the 
SRA Attitude Survey. Опе Е matrix (Етвов) was considered as 
being representative of the "real world" interrelationships of the 
job dimensions while the other two were distortions of Етвок—опе 
yielding orthogonality between the job dimensions (Forta), while 
the other produced stimuli having high intercorrelations between 
the job dimensions (Furen). Thus, three distinct theoretical F 
matrices and their associated theoretical R matrices were defined— 
describing three theoretical sets of stimuli. Two hundred “job pro- 
files” were then generated from each F matrix by the Ohio State 
Correlated Score Generation Method (Wherry, Naylor, Wherry, 
and Fallis, 1965) on the IBM 1620.1 

The stimuli were accordingly designated as hypothetical “job 
profiles” where each stimulus represented an unspecified job which 
had been given a score from 1 (very poor) to 9 (very good) on 
each of six of the fourteen SRA Attitude Survey job dimensions. 


Subjects 


One hundred and fourteen male introductory psychology stu- 
dents at The Ohio State University, divided into three equal 
groups of 38 members each based on their availability for testing, 
Served as raters. In doing so, they partially fulfilled their course 
requirement of experimental participation, Each group of 38 rated 
only one of the three sets of 200 “job profiles” generated from the 
three F matrices. 


Procedure 


Each rater in the three groups was given a set of directions 
describing the experiment and his task as a judge, a list of the 
six job dimensions with their descriptors, and an answer sheet. 
The raters were instructed that each ‘Sob profile," individually 


pS e e ln 
For à more comprehensive discussion of the three F matrices, profile gen- 
eration, and the experimental procedure see Dudycha and Naylor (1966). 
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displayed on a sereen with the identifying job dimensions, was 
representative of an unspecified job. They were told to examine each 
profile and give it a rating from 1 (very undesirable) to 9 (very 
desirable) relative to the criterion “job desirability” and record 
their response (У,) in the appropriate blank on their answer sheet 


Results 


Data Analysis 
After all raters in the three groups had been tested, a multiple 


regression anlysis on the IBM 7094 was performed on the 200 б 
values for each rater, treating the six job dimensions (Ха, Жа, 
+ + + , Xe) as predictors and the rater's response (Y,) as the 
criterion. Then 200 raw score Ӯ, (estimated or predicted) values 
were obtained for each rater based on his terminal regression equa- 
tion. Inverse factor analyses were performed—first, using intercor 
relations (ре) among the raters based upon each rater's actual re 
sponses (Y, values) and second, using intercorrelations (p,) among 
the raters based upon each rater's predicted responses (Y, values): 
This was done oh the p, and p, intercorrelation matrices obtained 
for the total sample, i.e., across the three groups of raters combine 
(114 X 114) and for each group independently (38 x 38). Hence; 
eight separate factor analyses were carried out using the Hotelling 
principal axis method with a Kaiser Varimax rotation on the 
IBM 7094. 


Comparison of pa and pm 


The pa and р» rotated factor loadings for the three groups, Frnus 
Forts, and Firm, are tabulated in Tables 1-3 respectively, whil 
Table 4 presents the rotated loadings for the combined groups 
From Tables 1-3 it is quite evident that the p, intercorrelatio 
matrices (those based on the rater’s predicted values) yielded more 
clearly defined and pronounced factors than those with pa (based 
on the rater’s actual responses). That is, the rater’s pm facto! 
strategy as compared with his p, strategy, regardless of his group 
membership, contained fewer contributing factors but at the same 
time accounted for an appreciably greater amount of his variation 
thus greatly facilitating any ensuing attempt to cluster raters relai 
tive to their basic strategies. In the combined group factor analyses 
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TABLE 1 


ра and pm Rotated Factor Loadings for the Етрув Group 


Riter ^ ——— —_ — === шы — 


Number ра pm Pa Pm Pa 


37 71 74 34 45 43 47 
38 80 83 36 42 28 34 


Factors 
4 

Pa Рт 

27 33 
24 0 
15 10 
19 05 
21 00 
15 01 
17 10 
15 -11 
21 05 
13 03 
15 02 
17 -02 
13 -1 
23 -1 
15 06 
14 05 
18 02 
19 —01 
14 —02 
18 09 
14 —01 
17 0 
85 12 
24 16 
21 04 
17 il 
15 00 
19 06 
14 04 
20 11 
18 —05 
19 —01 
17 00 
20 05 
20 —08 
19 07 
16 00 
10 03 


39 
14 
15 


09 
01 


11 


Note.—In this table and the four subsequent tables the decimal points have been omitted. 


three similar pronounced “group” factors (Table 4) were obtained 
for both p, and p» ав was expected. However, in this case, as was 
true with the groups factor analyzed independently, ps yielded 


Clearer factors, 
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TABLE 2 
ра and pm Rotated Factor Loadings for the Forty Group 
Factors 
1 5 6 
Rater 
Number pa Pm Ра Pm Pa Pm Ра Pm Ра Pm Pa Pm 

1 37 63 21 37 68 57 17 32 20 —14 01 —04 
2 33 38 40 50 38 51 25 56 26 08 26 —07 
3 28 35 56 40 30 45 1 70 37 06 14 09 
4 48 61 13 57 32 31 20 42 30 05 18 00 
5 39 53 34 43 52 57 14 42 45 05 10 —01 
6 28 38 22 61 43 53 26 29 20 13 62 27 
7 63 70 17 37 36 42 09 41 30 02 23 09 
8 33 44 81 31 51 71 03 43 21 02 28 04 
9 45 55 21 54 30 31 26 52 39 —01 00 —12 
10 50 60 30 35 52 54 22 44 27 —02 12 02 
11 41 46 24 52 22 26 28 64 37 —09 16 09 
12 18 25 43 51 42 49 14 63 37 —10 16 —02 
13 33 56 09 47 79 62 27 13 21 —21 13 —02 
14 57 00 37 38 32 35 23 58 31 07 17 00 
15 28 36 48 49 29 46 11 62 41 08 20 01 
16 53 61 34 28 28 40 14 58 10 16 13 —01 
17 24 28 23 59 24 33 07 65 80 —13 05 00. 
18 48 603 20 39 62 54 25 34 260 —15 06 —02 
19 20 35 36 51 59 64 14 42 38 —03 15 —09 
20 39 40 63 37 24 33 27 76 36 01 09 00 
21 75 85 23 18 43 35 03 29 14 10 19 04 
22 58 58 49 23 34 42 06 64 28 04 09 —05 
23 45 56 41 31 51 55 20 20-06 0 —04 
24 19 21 22 75 13 29 25 41 60 35 48 04 
25 36 41 38 63 32 36 30 53 48 —06 27 00. 
26 40 44 50 40 36 57 14 51 27 20 34 06 
27 42 57 97 43 51 51 23 46 33 —02 13 05 
28 54 56 49 26 25 36 14 69 25 00 00 05 
29 23 32 45 19 63 85 03 32 08 10 33 07 
30 27 45 31 47 63 68 20 31 21 —06 25 00 
31 56 57 49 34 24 35 21 63 23 10 18 07 
32 61 76 31 23 50 47 15 36 07 04 18 03 
33 47 60 30 44 43 47 29 45 21 01 12 04 
94 65 76 16 37 54 41 17 31 20 —06 07 —02 
35 69 64 24 36 10 11 13 54 31 00 16 00. 
36 16 31 14 85 24 29 87 28 15 —08 15 -04 
37 55 04 39 33 43 48 20 48 19 00 12 —05 
38 34 43 58 24 51 68 07 51 17 10 21 —03 


A more vivid comparison of pa and р» can be obtained by examin- 
ing the record of accountable variance (see Table 5 and Figures 
1-2). From Table 5 it is readily apparent that pm, in each case, did 
indeed account for a greater portion of the variance than did pa- 
This fact is also illustrated in Figures 1 and 2 where the cumula- 
tive accountable variance across factors is graphically depicted. 
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TABLE 3 
ра and pm Rotated Factor Loadings for the Faon Group 
Factors 
но о se 
T 2 3 4 5 6 
Rater ed 
Number ра pm ра Pm Ра Pm Pa Pm Pa Рт Pa Pm 
35 55 65 73 54 36 23 —06 —09 —06 04 03 
56 51 48 58 44 62 30 05 08 00 14 -03 
51 68 38 47 63 53 26 12 15 06 -01 -02 
61 56 44 53 47 62 29 -01 —01 00 00 
43 38 69 72 35 52 26 05 15 23 00 
54 38 63 67 33 62 21 11 07 01 01 
50 51 59 64 47 55 22 08 07 02 05 
27 47 32 69 28 53 84 04 03 04 00 
40 64 53 63 60 42 23 06 00 —03 -03 
54 63 45 53 58 55 22 05 08 06 04 
46 37 69 72 30 56 22 04 -12 18 = 01 
35 57 47 63 49 40 28 22 04 01 00 
45 74 38 47 72 45 20 —07 04 03 07 
77 60 22 23 49 75 20 00 04 01 00 
53 07 40 47 61 54 22 07 10 03 -05 
71 49 44 48 40 71 19 —01 -05 00 01 


928928828988988:398998-52589829922229222 
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73 43 45 48 33 75 19 05 
67 40 52 57 32 71 23 —04 -08 = 
63 58 39 46 50 64 21 14 12 

38 33 56 59 69 54 40 23 18 и 
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One can also note in Figures 1 and 2 the considerable difference 
in the rate of increase of the accountable variance of pa Over pa. 
Further, comparing pm with pa (Figure 1) for the three groups indi- 
vidually it can be seen that pm mon 8nd pm TRUE each have three 
interpretable factors while pm orta has four such factors accounting 
for the major portion of the variance (above 97 per cent) with rater 
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TABLE 4 
ра and pm Rotated Factor Loadings for the Total Group 


Factors 
SS ANO T EATUR gs 
1 2 3 4 5 
Rater 
Number pa Pm Pa Pm Pa т Pa Pm Pa Pm 
Ў 00 00 85 91 00 02 -—06 —01 16 —28 
2 03 02 92 97 03 03 -01 —01 12 —15 
3 03 03 86 97 06 07 -—03 -03 —06 03 
4 03 03 95 99 06 05 00 —01 -—02 00 
5 03 02 91 98 03 03 06 03 -—04 —03 
6 03 03 89 98 03 05 -03 -01 00 00 
7 04 02 93 98 01 03 00 01 06 —05 
8 03 02 93 98 0 02 05 02 03 —05 
9 05 04 88 96 09 07 -03 -03 —13 16 


10 00 03 90 95 10 08 07 00 -14 24 
11 03 02 85 98 02 04 00 01 -06 00 
12 03 03 89 95 04 07 09 02 -21 23 
13 04 02 88 98 0 02 02 02 15 —04 
14 0 OL 88 96 00 00 00 04 14 —19 
15 06 02 85 86 03 10 00 —05 —30 38 
16 0 00 83 88 -01 00 01 07 12 —15 
17 -0 0 90 97 05 03 -0 02 01 —04 
18 06 02 80 91 03 03 -01 —03 23 —14 
19 4 03 90 95 05 06 03 01 -19 23 
20 00 03 90 98 07 06 (ДА 0 -13 п 
21 o 01 83 90 06 02 04 05 02 —11 
22 05 03 91 98 05 05 03 —01 08 —02 
23 06 00 66 92 07 01 -13 -01 23 —32 
24 0 0 90 97 -01 03 -—13 00 13 —14 
25 02 03 93 99 06 05 -001 00 -—12 09 
26 05 04 90 96 04 08 —03 -03 -12 18 
27 00 02 85 95 10 05 09 04 -21 09 
28 -00 0 91 98 01 03 00 02 07 —10 
29 0 03 91 97 04 05 05 —02 00 02 
30 05 02 91 98 04 04 -03 02 04 —08 
31 04 03 92 97 05 04 01 —01 07 —01 
32 -00 o 86 93 —01 01 —05 -01 32 —26 
33 -0 02 91 99 05 03 06 00 04 00 
34 -00 02 92 97 05 03 —03 —01 12 —15 
35 05 03 92 97 05 04 04 00 —05 09 
36 оз 02 93 99 03 04 02 00 06 —02 
37 03 03 91 96 05 06 -03 00 -—16 20 
38 08 04 87 92 06 09 00 -03 —19 32 
39  -02 —03 03 07 83 95 -—16 —14 -01 05 
40 04 o 04 05 88 97 11 4:15 06 —02 
41 00 00 06 06 82 95 25 13 -—15 —06 
42 00 01 06 07 82 96 00 04 05 01 
43 00 00 04 05 87 99 02 —02 —04 00 
44 00 00 01 02 79 90 07 18 15 —05 
45 03 00 03 08 83 97 -—15 —17 -04 00 
46  —01 —01 02 04 84 95 —04 —05 05 —03 
47 001 00 06 08 84 96 07 08 03 03 
48  —03 00 04 06 91 98 —12 -11 04 00 
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TABLE 4—Continued 


Factors 


оз 97 04 03 0 01 0з 03 -0 02 m » 
88 оз 05 05 00 00 -0 м -1 01 05 o 
94 98 03 0 04 02 -2 02 па 0 
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TABLE 4—Continued 
Factors 
af 2 5 
Rater 
Number pa Pm Pa Pm Ра Рт Pa Рп Pa Pm Pa 

98 94 98 03 05 00 00 01 02 00 03 00. 
99 93 99 04 01 -02 00 01 00 04 00 -02 
, 100 94 99 01 03 00 01 04 01 00 02 01 
101 92 98 01 02 —01 00 00 —01 05 —02 —01 
102 90 98 00 01 01 00 -—02 -03 -—04 —02 00. 
103 93 98  —01 00 03 00 —02 —01. 00 —02 00 
104 93 98 02 04 04 02 01 02 02 03 02 
105 90 97 10 03 06 01 05 03  —08 02 —03 
106 95 99 01 03 04 00 02 01 —02 00 —03 
107 90 97 01 01 —01 00 -—02 —05 —03 -01 -03 
108 95 99 —04 02 00 01 —01 00 01 00 01 
109 91 97 02 02 00 —01 —02 -03 —03 00 01 
110 92 97 00 00 —03 00 01 —01 00 —04 00 
111 92 96 07 02 00 01 04 05 09 02 03 
112 91 96 07 03 00 02 05 06 07 03 00 
113 94 98 01 01 00 00 00 00 06 00 —03 
114 93 97 02 03 —01 00 00 —02 -01 -02 -02 


П doi 
—————————. 


Note,—Raters 1-38 comprise the PzRUE group, Raters 39-76 comprise the Ровтн group, and Raters 77- 
comprise the Рнтан group. 


unreliability removed, thus indicating that Factor 4 of pa нон, 
factors 4 and 5 of pa trom, and factors 5 and 6 of pa ortn are in all 


о 


Forvur 


Pa Pm 


2952" — 3437 
2952 3437 
1974 3379 
4926 6816 
2030 2975 
6956 9791 
0524 0079 
7480 9870 
0905 0065 
8385 9935 
0092 0022 
8477 9957 


Frown 


CIO 


Pa 
2852 


2852 


2669 


5521 


2343 


7864 


0120 


7984 


0111 


8095 


0107 


TABLE 5 
Cumulative Accountable Variance Across Rotated Factors 
Fortu Frou 
Pa Pm Ра Рт 
2012 2869 2719 3106 
2012 2869 2719 3106 
1322 2014 2666 3569 
3334 4883 5385 6675 
1938 2391 2538 3180 
5272 7274 7923 9855 
0573 2514 0740 0085 
5845 9788 8663 9940 
1042 0118 0105 0038 
6887 9906 8768 9978 
0465 0047 0094 0016 
7352 9953 8862 9994 


8202 


* Accountable variance of the rotated factors. 


b Accountable variance computed cumulatively across factors. 


Pm 


3189 
3189 
3110 
6299 
2986 
9285 
0120 
9405 
0097 
9502 
0097 
9599 


probability error factors based upon the rater's error variance about [ 
their respective regression planes. 


ә 


Cumulative Accountable Variance 
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/ Pm (TRUE) 

jd ------ Py (TRUE) 

/ 0—0 Pm (ORTH) 

/ n-—----06, (ORTH) 
я o — — Pm (HIGH) 


o-—— ——-og, (HIGH) 


Factors 


Figure 1. pa and pm cumulative accountable variance across factors for the 
three groups independently. 
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Figure 2. pa and pm cumulative accountable variance across factors for the 
three groups combined. 
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Discussion 

The use of p, as a basie similarity index between the rating 
strategy of judges appears to result in substantially "cleaner" and 
more interpretable rater groupings when inverse factor analysis is 
employed as a clustering technique than does the more conventional 
rater agreement under ра. Such is certainly not surprising, since as 
was pointed out earlier in Equation 1 the ра indices are equivalent 
to multiplying р» by the respective intrarater reliabilities. Indeed, 
as Naylor and Schenck (1966) point out in a previous article, pm 
may be considered to be ра corrected for attenuation due to the 
unreliability within each rater’s own policy or strategy. 

The use of p, tends to result in (a) less clearly defined factors 
and (b) a somewhat larger number of factors during strategy 
groupings. The first of these problems is probably the less severe, 
in that they may still be easily distinguishable groups. This was 
the case in the present study—the use of pm seemed to merely 
“sharpen” the three or four primary factors which had also emerged 
fairly distinctly with ра. We suspect, however, that the degree of 
additional clarity in the primary factors obtained by using pm 
should be even greater in cases where the Ra matrix consists pri- 
marily of low p, values. In the present study most p, values were 
above .70 to begin with and thus correcting them for unreliability 
did not leave a great deal of room for improvement—yet even 80 
marked increases in clarity of factors were obtained. 

The second problem, that of getting more factors from the Ra 
matrix than the Rm matrix, seems even more critical. One of the 
most persistent problems in using factor analysis as a rater cluster- 
ing technique is trying to decide where the “real” factors end and 
the error factors begin. One recent approach to this problem has 
been to include variables in the R matrix made up of random nor- 
mal deviates and then to inspect the size of the. loadings of these 
“variables” upon the various factors relative to the loadings of the 
“teal” variables in order to determine which factors are true factors 
and which are simply error factors (Cliff, 1965; Linn, 1964; Horn, 
1965). The use of pm indices would seem to avoid this problem 
entirely, since the only variance present in the В» matrix 18 IER 
matic linear variance. Thus the problem of error factors occurring 
in the F matrix becomes trivial In summary, there seems to be 


20  EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


little fustification for the use of ра as opposed to pm as the meas 
of policy agreement of raters whenever it is possible to describe the 
rater’s strategy through the use of a multiple regression equation, 
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А MUTUAL DEVELOPMENT OF SOME TYPOLOGICAL 
THEORIES AND PATTERN-ANALYTIC METHODS: 


LOUIS L. McQUITTY 
Michigan State University 


Бстемсв often advances by a theory concerning some aspects of 
reality. The theory grows out of observation and thought. The ob- 
servation may be either relatively crude or highly controlled, de- 
pending usually on the sophistication of the subject matter in- 
volved. The observation may be made by the theoretician himself 
or by study of observations pursued and reported by others. 

Hypotheses are developed as logical deductions,from the theory. 
If the theory is correct, then the hypotheses must also be correct. 
A more desirable relationship is one in which the theory is correct 
if the hypothesis is correct. Few if any hypotheses hold this rela- 
tionship to a theory in any absolute sense. 

The dependability of the relationship between hypotheses and 
theory is usually a matter of degree. In most instances, a set of 
hypotheses is more dependable than a single hypothesis, and n 
carefully developed hypotheses are usually more dependable than 
any subset of them. 

Hypotheses have a scientific value which theories do not usually 
Possess: their validity can be directly tested empirically. 

Science builds theory by testing hypotheses. Hypotheses are 
used to help generate research designs which test the hypotheses. 
If a hypothesis is investigated and substantiated, it gives support 
to a theory and the investigator develops other, hopefully more 
crucial, hypotheses which can in turn be tested. If, on tlie other 
hand, a hypothesis is not substantiated, the investigator usually 
Tevises his theory and develops new hypotheses for testing. 

‚ 1 Presidential Address, Division V, Evaluation and Measurement, The Amer- 
ican Psychological Association, September, 1965, Chicago, ‘Illinois. 
21 
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Approach 


The above description outlines one way in which science builds 
theory. The pattern-analytic. methods of this paper have been de- 
veloped in a similar fashion. 

A theory about the nature of individual differences in personal- 
ity structure is assumed. The assumed structure is used to generate 
statistical definitions of individual differences, and the definitions 
are used in turn to generate statistical methods for isolating and 
describing individual differences. The statistical methods are ap- 
plied to data. The findings either substantiate or fail to substanti- 
ate the definitions of individual differences and, therefore the as- 
sumed structure. The theory of personality is either extended or 
revised and is investigated further. 

A Theory of Types 

My pattern-analytic methods were developed in relation to a 
theory of types. 

Relation to Mental Health Status. My interest in types and 
typologies developed from a series of studies initiated thirty years 
ago and designed to develop methods of value in the objective 
assessment of psychological well-being, I suggested that an individ- 
ual is mentally healthy in any particular psychological area of 
self-description in which he conforms to a pattern characteristic 
of a significant number of persons who are living normal lives. 
Conversely, he is mentally ill in any particular psychological area 
of self-description in which he conforms to a pattern uniquely char- 
acteristic of a group of patients hospitalized for mental illness. 
From this point of view an individual can be mentally healthy in 
some psychological areas and mentally ill in others; he is a men- 
tally healthy type in some psychological areas and he is a men- 
tally Ш type in other psychological areas (MeQuitty, 1954а). 

Assumptions and Definitions. A type is a member of a category 
within some kind of classification system. This person and other 
members of his type have a combination of attributes which is 
uniquely characteristic of them; furthermore, the attributes of 
these persons are interrelated in a unique fashion. This set of 
attributes is not necessarily representative of all attributes; a per- 
son may belong to one type in one universe of attributes and to 
quite another type in another universe of attributes. 
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In a statistical sense, a type is a category of persons wherein 
everyone in the category is more like every other person in the 
category than he is like amy person in any other category. 

This definition is comprehensive in the statistical operations it 
requires; in order to isolate a type every person must be com- 
pared with every other person. 

A restricted definition can be derived as a logical consequence 
of the comprehensive one. The restricted definition says that a 
type is a category of persons wherein everyone in the category is 
most like some other person in the category; everyone is classified 
with the one person most like himself. 

If types do in fact exist exactly as required by the comprehen- 
sive definition, and if everyone belongs to a type, then both defini- 
tions yield identical results. This statement assumes, of course, 
that we have valid indices of likeness between people. 

Each of the two definitions, the comprehensive and the re- 
stricted, can be used to generate statistical methods for the isola- 
tion of statistical types. Application of the two methods to data 
assisted in the development of improved definitions of types, and 
these definitions were then applied to select the test items most ef- 
fective in depicting the types. 


Linkage Analysis 


Linkage Analysis (McQuitty, 1957; McQuitty, 1961) was de- 
veloped from the restricted definition of types. Every person is 
classified with the one person most like himself. Linkage Analysis 
was applied to the data shown in Table 1, a matrix of intercorrela- 
tions between selected people based on data especially prepared by 
Stephenson (1953) for isolating types. 

The first step in the analysis is to underline the highest entry in 
each column of the matrix of Table 1. For example, 577 is under- 
lined in Column A because it is the highest entry in this column. 

The next step is to select the highest entry of the matrix. It is 
615 and mediates between Persons © and D. Since 615 is the high- 
est entry in the matrix, it is necessarily reciprocal: C is highest 
with D and D is in turn highest with C. 

Persons С and D form the beginning of a statistical type and 
are so indicated in Type 1, Figure 1, by their designation as a 
Teciprocal pair. 
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Туре 1 Type 11 


Figure 1. The Types 
= Means a reciprocal pair of variables; C is most like D, and D is most 
like C. 
—» Means that the variable at the head of the arrow is highest with the one at 
the tail, but the one at the tail is not highest with the one at the head. 


Since everyone is classified with the one person most like him- 

Self, we must now search the matrix to determine if any person has 
either C or D most like himself. For Person D, for example, we 
merely read across Row D of Table 1, and we find that Person D 
is underlined as highest in his relation to Persons A, C, F, G, and H, 
respectively. Because all of these persons have D most like them, 
they must be classified with D as shown in Type 1, Figure 1. 
‚ We then examine the rows of each of these persons, A, С, Р, G, 
and H. Only two of them bring in another person: F brings in E 
and C brings in B. The rows for Persons B and E are examined. 
They bring in no additional persons. The first type is thus com- 
pleted. Subsequent types are isolated in the same fashion, using 
the reduced matrices. 


Rank Order Typal Analysis 


The above method of analysis was derived from a restricted 
definition of types. I will now review a method, Rank Order Typal 
Analysis (McQuitty, 1963; McQuitty, 1965), which was developed 
out of my comprehensive definition of types. The methods will 
then be compared and considered jointly with respect to the im- 
pact of data on them. 

In the comprehensive definition of types, every person in a cate- 
gory must be more like every other person in that category than he 
is like any person in any other category, not just most like someone 
in his category, as in the restricted definition. 
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“Symbolic representation of two categories of people, one ful- 
filling and the other one failing to fulfill the above definition of a 
type, is shown in Figures 2 and 3? respectively. 


Figure 3. A Non-Typal Category of People 
LJ 

“Tn the case of Figure 2, Person a has a most like himself, b 
second most like him and c third most like him. A similar condi- 
tion holds for each Person b and c. Consequently, each person in 
the category is more like everyone else in the category than he is 
like anyone not in the category; there is no rank in the category 
higher than the number of people in the category. When this latter 
condition is satisfied the definition of a type is fulfilled. 

( “Figure 3 reflects an exception to the above kind of internal con- 
sistency; Person т has 2 fourth most like himself, y third most like 
him, and he is first most like himself. This means that there is 
some other person in some other category who is second most like 
х. As a consequence, Figure 3 contains опе rank which is higher 
than the number of persons and does not therefore qualify as 9 
type; whenever a category contains one or more persons with 8 
rank higher than the number of persons in the category, it fails to 
qualify as a type. 


A Throughout this paper, quotations are included which refer to Tables and 
Teuron ba o in order to make the code numbers correspon 
о the order in which tables and figures appear in this paper they have been 
changed within the quotation, nce 
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“ТЬ is a simple matter to examine a matrix of interassociations 
between people in such a fashion as to isolate all of the categories 
which qualify as types. The task is accomplished by examining 
the matrix serially one column at a time. 

“The first step is to take a matrix such as the one shown in 
Table 2 and convert it into a matrix of ranks as reported in Table 
3 to show for example in Column 2? that the person most like 2 is 
æ himself and then w, y, and z in that order. 


TABLE 2 
Hypothetical Associations 
between People 
w c y 2 
w 7" 60 65 35 
z 60 72 50 40 
y 6 50 73 45 
2 35 40 45 74 
TABLE 3 
Rank Order within the Columns 
from Table 1 
w т y 2 
w 1 2 2 4 
: z 3 1 3 3 
y 2 3 1 2 
2 4 4 4 1 


“If z forms a type with any one person of Table 3,” he must 
form it with “Person w, who is second most like him (x being most 
like himself), In order to examine this possibility, we form a sub- 
matrix of z with w, Table 4, using the ranks from Table 8. Person 
x does not form a type with Person w; the submatrix contains & 
rank larger than the number of persons in the submatrix. Since 
Person т does not form a type of two persons with the one person 
second most like т (x being most like himself), Person 2 does not 
form a type of two persons with any other one person of Table 3. 

"If y forms a type with any two persons of Table 3," he must 
form it, “with the persons second and third most like x. These are 
w and y. A submatrix of w, т, and y is formed, Table 5. It contains 
по rank larger than 3. Table 5 proves, therefore, that z forms а 


so eee 


5 The italicized portions within the quotations were added in this paper. 
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` statistical type of three people with w and y and furthermore 
forms no other type of three people from those of Table 3" 
(McQuitty, 1965). E 


TABLE 4 
A Submatriz from Table 8 


оюк 
къ 


TABLE 5 
А Submatriz from Table 8 


w 


=оь|= 


z 
w 1 2 
z 3 1 
y 2 3 


The analysis proceeds in this fashion until Column z is com- 
pleted, and then successively until every other column is com- 
pleted (McQuitty, 1963; McQuitty, 1965). 


A Comparison of the Two Methods 


We will now compare the above two methods in some detail. Т 
have already proven (1964) that Linkage Analysis, even though 
based on a restricted definition of types, yields solutions similar to 
Rank Order Typal Analysis, which is based on a comprehensive 
definition of types. There are, however, some important differences 
as well as similarities. Specifically, I have shown that: 


1, Every type derived from Linkage Analysis contains one and 
only one reciprocal pair (a reciprocal pair is one in which 
i is most like j and j is in turn most like i). 

2. Rank Order Typal Analysis results in types wherein each one 
contains at least one reciprocal pair, but not necessarily only 
one; there may be several reciprocal pairs in any one or more 
types. 

3. Each reciprocal pair first yields a type of only the two mem- 
bers of the reciprocal pair, even in Rank Order Typal Anal- 
ysis. 

4. The first order types of Rank Order Typal Analysis are iden- 


f 


, 
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tical with those of Linkage Analysis when all of the persons 
of a matrix can be classified into types by the Rank Order 
Method. á 

Rank Order Typal Analysis can yield a hierarchical system 
of classification analogous to the Linnaean approach for the 
classification of plants and animals, as shown in Figure 4. 

. Linkage Analysis can be extended into a hierarchical method 

to yield types which contain more than one reciprocal pair. 


2 


e 


The Impact of Data 


When we apply these two methods to data, the discouraging out- 
comes are: 


1. Rank Order Typal Analysis yields few types; many persons 

fail to enter types, and 

2. Linkage Analysis yields “distant cousins"; some persons are 

far removed from the reciprocal pairs which initiate the 
types. 

In other words, the comprehensive, statistical definition yields 
few empirical types, and the restricted definition yields some very 
loosely-knit types. 

We have at least two options. We can reject our theory of types 
of we can assume that our data are in some way fallible and need 
to be improved if they are to yield statistical types. 

Data can be fallible in at least three ways: 


1. Assessments can be invalid; data can report that some per- 
sons possess characteristics which they do not, in fact, pos- 
Sess, or conversely, that they do not possess characteristics 
which they do, in fact, possess; 

2. Persons may be classified differently in different categories 
of items; for example, items appropriate to the intellectual 
classification of some people may be irrelevant to the intel- 
lectual classification of other people. (More specifically, an 
item may have become emotionally toned for a given individ- 
ual and may for him assess emotional characteristics rather 
than intellectual typology), and 

- Persons themselves may be invalid representatives of types; 
they may approach but never represent well any particular 
type. 


e 
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A Revision of Assumptions 


"These considerations about the fallibility of data led to the de- 
velopment of additional a&sumptions: 


“1, Every person is an ‘imperfect’ type as distinct from a ‘pure’ 
type; only ‘imperfect’ types exist in reality, and ‘pure’ types 
exist only in theory. 

There are fewer ‘pure’ types than ‘imperfect’ types; each 
‘pure’ type is represented in reality by two or more ‘imperfect’ 
types. 

. The characteristics of ‘pure’ types are approached but never 
quite realized by classifying ‘imperfect’ types into internally- 
consistent categories, and by determining their common char- 
acteristics. The validity of representation of a ‘pure’ type 
generally increases as the number of ‘imperfect’ types repre- 
senting it increases. 

‘Hierarchical’ types include all of the types realized in clas- 
sifying ‘imperfect’ types into larger and larger, internally- 
consistent categories; they are the types internaediate between 
those of reality and theory, ‘imperfect’ and ‘риге’” (McQuitty, 
1966b). 


ю 


e 
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Hierarchical Classification by Reciprocal Pairs 


We need a method of pattern analysis which will select those 
test items that best depict the statistical types. 

Linear models for psychological assessment have some sugges- 
tions to offer for improving data as they are being analyzed. 

“Linear models for the measurement of individual differences 
are usually very exacting in their requirements; they specify the 
model as linear and data are required to conform to it within 
certain specified limits. Consequently the model is helpful in the 
Selection of test items. Many techniques of item analysis have 
been developed for selecting sets of items which conform to linear 
models, 

"Pattern-analytie methods need analogous techniques to assist 
them in the selection of items; they need techniques which will 
Select items that conform to a pattern-analytie model" (McQuitty, 
1965). 

The techniques must be precise in the selection process. "This 
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principle should not, however, conflict with another desirable char- 
acteristic of pattern-analytic methods, Pattern-analytic methods 
should not require the model to specify the kinds of interrelation- 
ships reflected by the data; they should not, for example, superim- 
pose linear relationships when the predominant patterns are either 
curvilinear or disjunctive” (McQuitty, 1965). 

Hierarchical Classification by Reciprocal Pairs was designed to 
meet these requirements. The method selects pairs of persons which 
are indicative of types and improves them by incorporating addi- 
tional persons into the types and by eliminating characteristics 
which are irrelevant to the descriptions of the types. 

An Illustration. The method is illustrated by analyzing Table 
6, which shows agreement scores (Zubin, 1938) between eight com- 
panies in their union-management relations as assessed by 32 di- 
chotomized variables, 


TABLE 6 
Agreement Scores between Companies 


A 29 16 16 14 6 11 7 
B 29 17 17 13 6 8 10 
С 16 17 26 10 8 9 13 
р 16 17 26 10 12 11 11 
Е 14 13 10 10 21 17 13 
Е 6 6 8 12 21 19 17 
G 11 8 9 11 17 19 24 
н 7 10 13 11 13 17 24 


Note—Data from McQuitty, 1954 (b). 


"Two companies agree on a variable if they are either both above 
or below the median, but not if one is above and the other is below. 

"The entries for the reciprocal pairs are underlined; they are 
AB, CD, EF, GH; two construction companies, two trucking com- 
panies, à grain processing and a metal produets company, and 
two garment manufacturing companies, respectively. T'he two com- 
panies, A and В, for example, are reciprocal because A is highest 
with B, and B is in turn highest with A. 

"As the entry between A and B shows these two companies have 
common answers on 29 of the 32 variables; they disagree on three 
variables" (McQuitty, 1966b). 

In the next step of the analysis, each “hierarchical” type, such 
as AB, is assessed exclusively on the items on which its two mem- 
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bers agree. "The other three items were discarded as irrelevant to 
the description of this type. Analogously Types AB, CD, EF, and 
GH were assessed on 29, 26, 21, and 24 items respectively in the 
next stage; the other items on which the members of the types did 
not agree were discarded. 

“The next step was to prepare a matrix (Table 7) reporting the 
agreements between Hierarchical Types AB, CD, EF, and GH. In 
order for two types such as AB and CD to agree on an item both 
types must be assessed on that item and all four companies of the 
two types would have to have the same answer on the item, all 
either above or below the median but not some above and others 
below. 


TABLE 7 
Agreement Scores between Hierarchical Types 


AB CD EF GH 


AB 18 4 5 
CD в 4 6 
EF 4 4 10 
сн 5 6 20 k 


“The entries for the reciprocal pairs, underlined, show that Hi- 
erarchical Types AB and CD join to form a higher level Hier- 
arthical Type ABCD, and likewise EF and GH combine to form 
EFGH, to yield the results shown in Figure 5. 


ABCD-13** EFGH-12* 
4X13552%" вх 
Hierarchical Types AB/29* — CiN-26* ЕЕ/ 21% -2h* 
2X29/-58**  2Х2бу=52%® 2X2M-ü2kk 228 
Individual Types А B D E F d H 


E Figure 5. A Hierarchical Classification by Reciprocal Pairs 
a Number of common characteristics 
Classification capacity 


“It is a characteristic of the data and not the method that the 
classification is by twos. The method requires only that the recip- 
тоса] pairs of the first matrix be by twos. Other ‘hierarchical’ 
types can contain any number of members from three up; а *hier- 
archical’ type of two members from the first matrix, for example, 
Could join a single ‘imperfect’ type of the second matrix. 
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* An Interpretation. 'The example illustrates how the method con- 
forms to its design and improves on the validity of the statistical 
types as it proceeds. This fact is illustrated with respect to Type 
EFGH. In Rank Order Typal Analysis, this category of four com- 
panies failed to qualify as a type (McQuitty, 1963). In the pres- 
ent study, on the other hand, it did qualify. It did so because each 
Type EF and GH was improved as representative of Type EFGH. 
The Types EF and GH were improved by eliminating from their 
respective descriptions items on which their two respective mem- 
bers did not agree” (McQuitty, 1966b). 


Hierarchical Classification by Rank Order Types 


Reciprocal pairs, the key concept in the above method, are in- 
ternally consistent categories of only two representatives; each 
representative of every pair is more like the other representative of 
the pair than it is like any representative of any other category. 

Internally consistent categories can, however, be composed of 
more than two representatives. They are therefore presumed to 
be more valid» In which instance, every representative of every 
category is more like every other representative of that category 
than it is like any representative of any other category. Hier- 
archical Classification by Rank Order Typal Analysis isolates all 
the internally consistent categories of every matrix, irrespective 
of the size of the categories. 

There is but one exception to this principle. The method will not 
accept Category X of n representatives as satisfying the definition 
of internal consistency if X includes any subset of representatives 
which does not satisfy the definition of internal consistency. In 
other words, the categories must not only be internally consistent, 
but, in addition, all the sub-categories which were considered in 
building them must also be internally consistent. If this were not 
true, the method could be extremely laborious in classifying some 
sets of data (McQuitty, 1966b). Hierarchical Classification by 
Rank Order Types proceeds in the same fashion as by Reciprocal 
Pairs except that it uses internally consistent categories of any 
size, not just pairs, 

Differential Validity of Items. When the above methods are 
applied to the isolation of statistically defined types, they both 
reject and retain items, The several types of a single study dis- 


pn 
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card and retain various combinations of items. Consequently, it 
would be possible, theoretically, for eight types to discard and re- 
tain all possible combinations of Items a, b, and c, as shown in 
Table 8. The items could then be said to have differential validity. 


TABLE 8 
Differential Rejection and Retention of Items by Eight Types; Hypothetical Data 


Typs 8 T U У Ww x Y Z 


Items 
a x x 0 0 x x 0 0 
b x 0 x 0 x 0 x 0 
c x 0 0 x 0 x x 0 

X—retained 

O—rejected 


This means, for example, that the endorsement of Item c is indica- 
tive of Type X or Y, depending on the combination of other items 
which is endorsed along with it. 

Differential validity is a central concept here. It can be con- 
trasted with invariant validity, where an item is required to assess 
the same variables on all people, irrespective of who endorses it 
and the combinations of endorsements of other items with which 
it occurs (McQuitty, 1959). 

If an item endorsement is required to assess the same variables 
across all categories of people, it will generally reflect considerably 
less than perfect validity. If, on the other hand, an item endorse- 
ment is allowed to assess different attributes in various categories 
of people, the endorsement will presumably increase in total valid- 
ity. 

Items with high invariant validity can not be improved much 
by methods which allow them to assess different attributes in vari- 
Qus categories of people. Items need to have low invariant validity 
in order to be improved greatly by differential validity. 

An initial indicant of the possibility of improved validity is the 
Occurrence of endorsements in various combinations. The problem 
18 to isolate those combinations which maximize total validity. 

An Elaboration of Assumptions. Additional assumptions are re- 
Quired to develop methods for meaningful interpretations of the 
differentia] retention and rejection of items by the various statis- 
tically defined types. We are not satisfied to conclude that re- 


36 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


jected items solicit only chance endorsement by the members of 
the types which reject them. 

Various sets of pressures act on a child as he grows up. One 
set of pressures might be from the family as a whole, or from 
pressures within the family; the mother and father might each 
represent a separate set of pressures. Other examples of sets of 
pressures are those of the church, the school, and peer groups. 

We assume that every child tends to develop into several types 
as a consequence of the operations of multiple sets of pressures. 
Every child has a different classification in each of the several 
typologies. 

Every type to which a child belongs is probably somewhat in- 
complete and immature, but some types are more predominant 
than others. Each type tends to be somewhat obscured by the 
presence of other types. 

If the types are to be separated and displayed, then methods for 
the isolation of multiple types within individuals are needed. 

The simpler approach, in a purely numerical sense, is to assume 
that the types are independent. On the other hand, given the 
operation of sets of pressures, we accept the Possibility of over- 
lapping and intersecting types. 

For the purpose of developing statistical methods, two types are 
independent if they have no common characteristics, but they are 
intersecting if they have one or more common characteristics. 


Multiple Rank Order Typal Analysis 


Rank Order Typal Analysis can be elaborated to isolate both 
independent and intersecting types. To isolate intersecting types, 
assume that a matrix has been analyzed to yield statistical types: 
Ti, To, Ts,... Ts. A set of characteristics is associated with each 
type: it is the set of characteristics on which all members of the 
type agree, In order for a person to be a member of a statistical 
type, he must possess all the statistical characteristics of that 
type. 

Sub-types. Some persons may be sub-members of a type; they 
Possess less than all the characteristics of a type, and in their 
predominant pattern of characteristics they are members of other 
types. Because we need to isolate sub-types as well as types, we 
must specify a statistical definition of a sub-type. A sub-type is de- 
fined in terms of the characteristics of a type, exclusively. The 
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sub-type must possess less than all the characteristics of the 
iype, but every member of a sub-type must be more like every 
other member of that sub-type than he is like any other person. 

This definition dictates the direction of elaboration in method. 
It says: analyze a matrix to determine first the members of a 
type and the items which define it. Using only these items, com- 
pute another matrix of interassociation, using all the people of the 
original matrix, and analyze this new matrix. This second analysis 
will do two things: it will reproduce the original type, and it will 
yield whatever sub-types are definable in the items of the original 
type. 

Illustrating the Isolation of Intersecting Types. The method will 
now be illustrated by applying it to the data of Table 6, which re- 
ports agreement scores between eight industrial companies, in 
union-management relationships. 

The first step was to apply Rank Order Typal Analysis to the 
data of Table 6 to yield the initial types. The results of this 
analysis are shown in Figure 6. 


prr Sd 

Hierarchical Types AB 29" СРУ. 26* ЕЕ К 217 вну 24 

NES ТИ, N 
A B C D E ен 


Individuals 


Figure 6. A Hierarchical Classification by Rank Order Тура! Analysis of Eight 
Companies, All 32 Variables 
* Number of common characteristics 


There are five types, each with a set of items. For example, the 
two Companies A and B of Type AB agree on 29 items. Using 
only these items, a new matrix of agreement scores between com- 
panies was computed, as shown in Table 9. 


TABLE 9 
Agreement Scores between the Companies on the 29 Items of Type AB 

A B c D E F G H 

Soo A B рые ее 
A 29 29 15 15 12 5 8 7 
B 29 29 15 15 12 5 8 7 
M 15 15 29 25 8 8 8 11 
р 15 15 25 29 10 10 10 9 
Е 12 12 8 10 29 20. 15 12 
Е 5 5 8 10 20 29 И 16 
Ө 8 8 8 go в АЗ 
E 7 7 п 9 12 16 24 29 
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In a similar fashion a matrix was computed for the items of 
each of the other four types, CD, ABCD, EF, and GH, and they 
are shown in Tables 10, 11, 12, and 13. “The Rank Order Typal 
Analyses of Tables 9 through 13, inclusive, yielded in addition to 
the types already isolated, two new Types, FGH and EFGH. The 
matrices for the items common to the members of each of these 
two types are shown in Tables 14 and 15. Each of them was ana- 
lyzed by Rank Order Typal Analysis. 


TABLE 10 
Agreement Scores between the Companies on the 26 Items of Type CD 
A в. C D E Е G H 
—=— АА 

А 26 25 13 13 10 5 8 6 
B 25 26 14 14 9 4 7 7 
с 13 14 26 26 7 7 7j 9 
D 18 14 26 26 yj 7 Т. 9 
Е 10 9 7 x 26 19 14 12 
F 5 4 7 7 19 26 14 14 
G 8 7 7 tf 14 14 26 22 
H 6 7 9 9 12 14 22 26 


TABLE 11 


A 13 13 2 0 1 1 
B 13 13 13 13 2 0 1 1 
с 13 13 13 13 2 0 1 1 
р 13 13 13 13 2 0 1 1 
Е 2 2 2 2 13 1 10 10 
Е 0 0 0 0 11 13 12 12 
G t 1 1 1 10 12 13 13 
H 1 1 1 1 10 12 13 13 
TABLE 12 
Agreement Scores between the Companies on the 21 І. tems of Type EF 

A B с р Е Е а н 
А 21 20 12 12 5 5 
B 20 21 13 13 4 4 А 7 
(9) 12 13 21 19 4 4 3 T 
р 12 13 19 21 6 6 6 5 
5 5 4 4 6 21 21 13 10 
Bu. EM 
H fA 7 = E 13 21 18 
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TABLE 13 
Agreement Scores between the Companies on the 24 Items of Type GH 
A B Cc D E F G H 
A 24 24 14 14 8 4 5 5 
B 24 24 14 14 8 4 5 5 
[o 14 14 24 22 8 5 7 7 
D 14 14 22 24 8 5 7 7 
E 8 8 8 8 24 18 1 1 
F 4 4 5 5 18 24 14 14 
G 5 5 7 7 11 14 24 24 
н 5 5 7 7 11 14 24 24 
Note—Rank Order Тура] Analysis was applied separately to Tables 10 through 13. 
TABLE 14 
Agreement Scores between the Companies on the 14 Items of Type FGH 
A B с р Е Е а H 
A 14 14 13 13 4 0 0 0 
B 14 14 13 13 4 0 0 0 
С 13 13 14 12 3 1 1 1 
D 13 13 12 14 2 1 1 1 
E 4 4 3 2 14 10 10 10 
F 0 0 1 1 10 14 14 14 
а 0 0 1 1 10 14° 14 14 
H 0 0 1 1 10 14 14 14 
e TABLE 15 
Agreement Scores between the Companies on the 10 Items of Type EFGH 
A B c D E F G H 
A 10 10 10 10 0 0 0 0 
B 10 10 10 10 0 0 0 0 
О 10 10 10 10 0 0 0 0 
D 10 10 10 10 0 0 0 0 
E 0 0 0 0 10 10 10 10 
E 0 0 0 0 10 10 10 10 
G 0 0 0 0 10 10 10 10 
H 0 0 0 0 10 10 10 10 


Results 


“Table 6 shows the original indices of association between the 
companies. Tables 9 to 15, inclusive, show analogous indices where 
only items common to the members of a type were used in com- 
puting the interassociations. A simple observation reveals that the 
latter tables reflect statistical types more clearly than the original 
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table, thus indicating that the method of item selection is effective 
in picking out the items indicative of types. 

"The results from Multiple Rank Order Typal Analysis for the 
Isolation of Intersecting Types are shown in Table 10. Starting 
with the items common to the two companies of the construction 
type, Companies AB, the reapplication of Rank Order Typal Anal- 
ysis yielded Types AB (construetion), CD (trucking), EF (grain 


TABLE 16 
Results from Extended Rank Order Тура Analysis 
Using Number Types Obtained and the Number of Items on 

items of оѓ items Which Members Agree 

Type AB 29 AB-29 CD-25 ABCD-13 EF-20 GH-24 
Type CD 26 AB-25 CD-26 ABCD-13 EF-19 GH-22 EFGH-10 
Type EF 21 AB-20 CD-19 ABCD-10 EF-21 GH-18 EFGH-10 
Type GH 24 AB-24 CD-22 ABCD-13 EF-18 GH-24 EFGH-10 | 
Туре ABCD 13 ABCD-13 GH-13 FGH-12 EFGH-10 
Type FGH 14 ABCD-12 FGH-14 EFGH-10 
Type EFGH 10 ABCD-10 EFGH-10 
ыы 


processing and metal products), GH (garments), and ABCD (con- 
struction and trucking). Each of the other three dyadic starts 
(Types CD, EF, and GH) yielded not only these same types (as 
did starting with Type AB) but in addition they produced Type 
EFGH (grain processing, metal products, and garments). Starting 
with Type ABCD yielded, of course, itself, failed (necessarily) to 
yield Types AB and CD, agreed with the others in yielding Type 
GH, but failed to yield Type EF. Instead it produced Type FGH 
and then Type EFGH. 

“Starting with Type FGH yielded Types ABCD, FGH, and 
EFGH, and starting with Type EFGH produced Types ABCD 
and EFGH” (McQuitty, 1965). 

The appearance of Type EFGH in six analyses but not in the 
first one is related to other findings. When the data were forced into 
types (without satisfying a criterion of internal consistency as re- 
quired in Rank Order Typal Analysis), Type EFGH invariably 
appeared even though all items were used (MeQuitty, 1954b, 1960, 
for example). On the other hand, when our criterion of internal 
consistency was applied to the individual companies and all the 
items, Companies E, F, G, and H did not combine and qualify as 
a type. They qualify only after certain items have been withdrawn 
in the process of improving typological descriptions, 


— 
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There is another way in which Type EFGH can be obtained 
with a criterion of internal consistency. Rank Order Analysis by 
Reciprocal Pairs applies a criterion of internal consistency and 
produces this type as shown in Figure 5. This method first cate- 
gorizes E and F into one type and G and H into another type. It 
then defines each of these types by the characteristics common to 
the members of each type. These two types, EF and GH, are in- 
ternally consistent, as shown in Table 7; Type EF is most like 
Type GH and Type GH is most like Type EF. 

Type EFGH appears only when the data have been purified by 
the elimination of irrelevant items. 

"The results illustrate two points: (a) the items used determine, 
of course, the types, and (b) there are various meaningful points 
of departure for the selection of items in a typological study. One 
point of departure, for example, is to select a set of items designed 
to be representative of union-management relations in general. 
These items can then be analyzed as reported in this paper to in- 
dicate that different sets of items are associated with different in- 
dustries. Each set can be applied separately to investigate how 
each industrial type is intersected by other industrial types. 
Another point of departure is to select for analysis sets of items 
which are characteristic of various industries, and then purify 
them statistically. 

The Isolation of Independent Typologies. Intersecting typologies 
can now be compared with independent typologies in describing 
how the independent statistical typologies are isolated. In both in- 
stances a person can theoretically hold membership in two or more 
classifications, In intersecting typologies, his classification in two 
typologies is based on some common characteristics, but in inde- 
Pendent typologies there must be no overlap of characteristics for 
any one person in two or more classifications. 

The first step of the analysis is the same for the two approaches. 
Both apply a method such as Rank Order Тура! Analysis to а 
matrix of interassociations (Table 6, for example) to yield the first- 
order typology. 

In the independent approach, the next step is to decide the ter- 
mination point of the first typology. The termination point can be 
Specified by means of inclusive types, first-order types, statistically 
significant types, or perhaps some other concept. 
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The inclusive types are the minimal number of types necessary 
to classify every individual. In Figure 6, for example, the inclusive 
types are (1) ABCD, (2) EF, and (3) GH. The first-order types, 
on the other hand, are (1) AB, (2) CD, (3) EF, and (4) GH. 

The method is illustrated with inclusive types. For each inclu- 
sive type, the characteristics common to the members of that type 
are specified and then they are subtracted from the original de- 
scription of each of the individual companies of the type. For ex- 
ample, Companies A, B, C, and D formed a type and agreed on 
13 of the 32 original, dichotomous items. The subtraction left each 
of these four companies described in only 19 items. As a result, 
each of these companies could no longer agree with any com- 
pany on the items common to Type ABCD. 

"Analogously, the common characteristics were selected out of 
the description of the two companies for each of the other two in- 
clusive types, Types EF and GH. 

“A new matrix of agreement scores was prepared for all of the 8 
companies using only the items which remained for each individual 
company, E 

“A Rank Order Typal Analysis was performed on this matrix. 
It yielded Types AB and CD; no other combinations qualified as 
types. These Types, AB and CD, had already appeared on the 
way to yielding Type ABCD. 

"The method proceeds with Types AB and CD, just as it would 
with any other ‘inclusive’ types (as defined above); the common 
characteristics of each Type AB and Type CD were subtracted 
from each A and B, on the one hand, and C and D, on the other. 

“A new matrix of agreement scores was formed, just the same as 
the last one except for the omissions of items in Companies A, B, 
C, and D as already Specified, 

"The last matrix was analyzed by Rank Order Typal Analysis 
and failed to yield any types, The analysis was thus completed" 
(MeQuitty, 19662). The first two analyses yielded the results 
shown in Figures 7 and 8. 


Types and Agreements ABCD 4, 13 
Types and Agreements AB x29 cD 2 TAS GH DS 
Individual Companies A B C D E^ G H 


Figure 7. First Analysis 


e 
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Types and Agreements мт cD = 13 
Individual Companies A В IN 
Figure 8. Second Analysis 


“Figure 7 reports the exact number of characteristics that the 
members of each type have in common. Furthermore, it reveals the 
number that the members of each AB and CD have in common 
over and above those which all four companies have in common as 
members of Type ABCD; they are 16 and 18, respectively (20— 
13=16 for Type AB and 26—13—13 for Type CD). This latter 
fact is reported also in Figure 8, but Figure 8 adds information 
still not revealed by Figure 7: it shows that the data of this 
study reveal only one system of independent types, in terms of the 
stringent definition of types used in the above analysis" 
(MeQuitty, 19662). 


Independent Versus Intersecting T'ypes 


It is helpful to contrast the above analysis in search of indepen- 
dent types with our search for intersecting types? The search for 
intersecting types yielded the typology of Figure 9. 


= Predominant Types and Sub-types 
===- Sub-types only 


Figure 9. A Typology from a Search for Intersecting T'ypes 


The predominant types of Figure 9 are the same as the in- 
dependent types of Figure 7, first analysis. They were obtained 
if we used all the items of the original matrix, Table 6. They were 
obtained again however, if we used only the 29 items of Type AB. 
Consequently, they must be classified as sub-types in this latter 
instance, The other types shown are all sub-types; they were de- 
rived from the items descriptive of other types. 
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“The present set of data reflects two or more similar configura- 
tions of intersecting typologies but only one system of independent 
types. The substantive implications of this very limited study are 
that various pressures do in fact act on institutions to yield more 
than one typology and that the pressures interact to yield inter- 
secting rather than independent types” (McQuitty, 1966a). 

The substantive implications of this study need to be investi- 
gated further on much more comprehensive sets of data for people, 
institutions, and other objects in many cultures, including, of 
course, cross-validational studies. The contributions of this and 
related papers are in the methodologies which they provide for 
additional substantive studies. 


Methods for Continuous Data 


Except for Linkage Analysis, the methods thus far outlined ap- 
ply to discrete data. The theories which generated the methods 
for diserete data can also generate methods appropriate to con- 
tinuous data. For this purpose it is helpful to summarize the theory 
which we have been using in developing the methods. 

A Review of Theory. “The theory says that every individual 
Tepresents a succession of types, first an individual type, then types 
analogous to a species, a genus, a family, ete. As more and more 
individual types are classified together to represent higher and 
higher orders of hierarchical types, the:suecessive categories be- 
come better representatives of pure types, which exist only in 
theory. 

"Individual and hierarchical categories of the above kind are 
jointly characterized as typal representatives. 

"Every typal representative at any level z is best classified at 
the next higher level if it is classified with the typal representative 
most like it at level x and if the two representatives are reciprocal, 
i.e., Тура] Representative i is most like j, and j is in turn most 
like ? (McQuitty, 1966c). 


A Basic Equation 


This theory can be used to generate a basic equation which can 
be applied repeatedly to complete hierarchical classification of per- 
sons based on their interassociations 88 reported in a matrix: 


LOUIS L. McQUITTY 45 


The Basic Equation 
“Let: 
ij = any typal representative formed by combining the two 
typal representatives $ and j from the next lower level, 
level z 
k = any ура] representative other than 7 and j from level z 
а, = an index of association between $ and К 


aj, = an index of association between jand k 
аць = an index of association between ij and k. 


“Then: 


D' аат tatan” (McQuitty, 1966c) 


This equation can be used to compute the index of association of 
any type at any level of classification with any other type at the 
next lower level of classification. A criterion of internally con- 
sistent categories can be used to isolate types in the original 
matrix. The above equation can be used to prepare the second and 
each succeeding matrix from its immediate predecessor. Thus, the 
analysis proceeds from one matrix to the next, even if the data are 
continuous. 


Summary 


In summary, various sets of pressures are assumed to act on 
individuals, institutions, and other objects as they develop. Con- 
sequently, each object tends to develop more than one type, and à 
sample of objects reflects more than one typology. The multiple 
existence of types obscures the types from direct observation. 

Quantitative methods are needed to help separate and display 
the various types and typologies. 

A key consideration in the methods of this paper is the tech- 
nique for isolating the items which best depict types. For this pur- 
Pose, we use the internally consistent submatrix, where every mem- 
ber of the submatrix is more like every other member than like 
any member of any other submatrix. A special example of this 
submatrix is the reciprocal pair, where $ is most like j and j is in 
turn most like i. 

If these internally consistent submatrices do in fact yield fruit- 
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ful, initial types, then they can be elaborated as shown to classify 
people, institutions, and objects into meaningful multiple ty- 


pologies, 
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A COMPARISON OF TWO METHODS OF CLUSTER 
ANALYSIS 


MAURICE LORR 


Veterans Administration 
Washington, D. C. 
AND 
BELUR K. RADHAKRISHNAN 


George Washington University 


Tue researcher is often faced with the possibility that his 
sample is not homogeneous but represents a mixture of several dis- 
tinct populations. The problem then, given a set of measurements 
on a sample of individuals, is to detect any mutually exclusive 
subgroups that may exist. A procedure is thus needed for identify- 
ing the subgroups in a non-random sample of persons studied, 
Formal models such as factor analysis (Stephenson, 1936) and 
latent profile analysis (Gibson, 1959) have been proposed as solu- 
tions. A variety of looser procedures labeled “cluster analysis” 
have also been suggested. Within this class fall methods proposed 
by Thorndike (1953), McQuitty (1964), Sawrey, Keller and Con- 
ger (1960), Saunders and Schucman (1962), Gengerelli (1963), 
and Lorr, Klett, and McNair (1963). 

No effort will be made here to review available cluster methods. 
The problem here is one of comparing two rather efficient pro- 
cedures for identifying homogeneous groups of score profiles. Both 
methods have been programmed for the 7090 and are applicable 
to the correlations or congruency coefficients among score profiles 
of 150 individuals. 


Method of Procedure 


Cluster Method A 


The first procedure called Method A was developed by Lorr and 
McNair (Lorr, Klett, and McNair, 1963). The clustering process 
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begins by listing with each coded standard score profile the code 
numbers of all other Profiles that correlate at or above a lower 
value 01. The limit C; may be set at the value at which a coeffi- 
cient of correlation based on K independent variates is significant 
at the 5 per cent level. The profile with the longest associated list is 
selected as a pivot. To the Pivot is added the profile with the 
highest average correlation with all members in the pivot list. To 
the first pair is added the profile that correlates highest on the 


This procedure continues until there are no other profiles whose 
mean correlation with the cluster equals or exceeds Cj. 

Next an upper limit C, which defines dissimilarity, is set to pre- 
vent cluster overlap, A suitable value is the point where а соећ- 
cient is significant at P < .10. The procedure is to delete any 
Profile in the residual matrix that correlates on the average C, or 
higher With the first cluster, These profiles are not considered for 


Second cluster are deleted from the matrix, The clustering process 
18 continued until no cluster with at least four members can be 
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Cluster Method B 


The second method employs the multiple group factor method 
and a weight matrix (Guttman, 1952) without resort to the com- 
putation of residual matrices. The first step is to find a profile near 
the center of a cluster. This is accomplished by determining the 
variance of each person’s squared correlations with all others 
(Tryon, 1958). The profile with the maximum variance is selected 
as a pivot. The squaring process serves to magnify high correla- 
tions and to minimize low correlations. Then a weight vector w 
is formed that consists of the cubed correlations of the pivot with 
all other profiles, except that the element corresponding to the 
matrix diagonal is replaced by the largest absolute value in ui. 
"Thus 


шу = [а rin + В] (1) 
in which rj, is the correlation coefficient between the scores of per- 
sons j and k. 

The correlation matrix R, is then premultiplied by the vector w’ 
to yield the 1 by n matrix S^,. Each element in S^ isenext divided by 
the square root of ш, В: to yield 


5” 

М vd (wRiw) 9 
the vector of profile correlations with the first type factor. Profiles 
correlating at or above lower limit C; are designated members of 
the first cluster. Profiles that correlate between C; (lower limit) 
and C, (upper limit) with T, are deleted from the matrix and are 
not considered for inclusion in succeeding clusters. This process, as 
in Method A, serves to prevent cluster overlap. 

To locate a second cluster, select as a pivot profile the one with 
the maximal variance of its column of 728 in the reduced correlation 
Matrix. Subsequent steps are the same as previously described. A 
new weight vector w’ is formed, а S' matrix is computed, and its 
elements are divided by the square root of 1/2202 to determine 
Т». Profiles that correlate at or above C; are designated cluster 
members while those that correlate between C; and C, are deleted 
from R; to leave Ro. 

The process of cluster formation is continued until no cluster can 
be found that is comprised of at least four profiles. At this stage all 
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clusters are examined for overlap. If the average correlation be- 
tween a member of one cluster with members of any subsequent 
clusters exceed C,?, then the member in question is removed from 
the cluster. The process of elimination compares members of 
cluster 1 with all other clusters, then members of cluster 2 with all 
others, and so forth. 

The correlation of a person with a factor (factor loading) was 
taken to be comparable to an item-test correlation. The average 
correlation of an item with a test can be shown to be proportional 
to the square root of the average inter-item correlation. It follows 
from this reasoning that the critical points С, and C, for the type 
factor loadings should be approximately equal to the square roots 
of C; and C, values set in Method A. 

Application of the weight vector w, to the correlation matrix 
results in 4th powers of the person-person correlations, It serves as 
a rotational transformation by increasing the large factor loadings 
and decreasing the small ones for each variable of an unweighted 
factor matrix. This procedure has been suggested by Overall and 
Porterfield (1963). Analytic methods of rotation such as the 
Kaiser varimax also involve maximization of functions involving 
the fourth powers of factor coefficients (Harman, 1960). 

Both methods of cluster analysis differ from conventional factor 
analysis in several respects, The grouping procedure is always ap- 
plied to the original correlation coefficients and not to residual 
Coefficients. Successive clusters are generated from correlation 
matrices, each of smaller order than the original matrix, All of the 
clusters are based on Positively correlated profiles unlike the bi- 
polar type factors of obverse factor analysis. Finally between- 
cluster correlations are permitted to be highly negative. 

For the data analyzed in the present study, the lower and upper 
critical points C; and С, were set at .55 and .40 for Method A. For 


du B, the multiple group procedure, C; and C, were set at .74 
and .63, 


The Measures and the Sample 


The data collection instrument was the Inpatient Multidi- 
mensional Psychiatrie Scale (IMPS) (Lorr, Klett, MeNair, and 
Lasky, 1962). This rating schedule consists of 75 scales and yields 
ten relatively independent psychotic syndrome or factor scores. 
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The schedule has been subjected to repeated factor analyses and 
has demonstrated high inter-rater reliability. 

The subjects consisted of 150 functional psychotics newly ad- 
mitted to psychiatrie hospitals. Each patient was interviewed be- 
fore drug treatment and rated immediately thereafter on IMPS 
by the interviewer and his observer. The ten raw factor scores of 
the two raters were then combined and transformed into standard 
scores. 


Findings 


Application of Method A to the correlations among the 150 
score profiles yielded six subgroups. The six subgroups included 
108 or 72 per cent of the 150 profiles. Application of Method B to 
the same data yielded five subgroups and classified 51 per cent of 
the 150 cases. These differences in number of clusters and per- 
centage classified were probably due to the lack of comparability 
of the limits (C; and Cy) set for the two procedures. 

The mean syndrome standard score profiles of the five matched 
subgroups are given in Table 1. As may be seen, by inspection 


TABLE 1 
Mean Syndrome Profiles of Matched Subgroups I dentified by Methods A and B 
Syndrome 
Type EXCIHOS [РАВ | GRN |PCP|INP | RTD | DIS |MTR|CNP 
Intropunitive — A|— 48 —.60|—*15| —.58 |—.34 1.89) .30 |-.35|—.29|—.51 
ntropunitive ^ B|-.43,—.50|—.59| —.53 | .09 2.13] .81 -.34—28|—94 - 
Excited Al 2.26 1.01—.23| .34 |—.61|-.62| —.68 |—.40| .80| -62 
cited B| 2 53| 15—20] 48 |—.62|—.72| —.81 |—.45) 37.57. 
Disorganized al—.57|—.61|—.64| —.32 |-.31/-.49] 1.59 | 1.50] .56| .83 
isorganized B|—.67|—.77|—.87| —.47 |—.47|—.52| 1.40 | .28 .28|— -08| . 
Hostile Раг. —.67 
ile Par. Al—.39 1.19 .90| —.21 —.10| .08  —.45 |-.34 —.58 —. 
, Hostile Par, AES 1-19 “Fal —38 |-34 -.00| —.44 |—-42|—.65)—.69) - 
PGrendiose Par. АЈ .09| .09| .38| 2.72 | 1.21|-.35| —.32 —.81.—.29|-.14 
randiose Par В| 5б 12 .54| 2.85 | .98|—.97, —.48 I- -3017.351118 


the agreement in profile is excellent. The congruency coefficients in 
the last column of the table support this assertion. 


To what extent do matched subgroups include identical 
bers? As may be seen in Table 2, in no instance was a 


mem- 
profile 
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classified in two different clusters by the two methods. Method A, 
perhaps because of the less restrictive limits, classified more pro- 
files than Method B. 


TABLE 2 
Number of Cases Matched in Subgroups Isolated by Two Methods 
Method B Method A Clusters 
Clusters ОЕ РО ЧҮ S Unmatch Total 
1 17 13 30 
2 11 5 16 
8 17 4 21 
4 16 6 22 
5 9 4 13 


Unmatched T6 05 8770/2 
"Total 18 11 20 16 11 


Py ce qns 

The degree to which each method successfully differentiated the 
subgroups may be judged by the average correlations in Tables 3 
and 4. The within-cluster correlations are .62 or higher. No cluster 


correlates positively with another more than .21. Thus good sep- 
aration of the subgroups has been achieved. 


TABLE 3 
Average Correlations among Subgroups Identified by Method A 


Type 1 2 3 4 5 6 
Disorganized 


0 15 .66 
Excited 2 —.30 77 
Tntropunitive 3 05 —.31 .78 
Hostile Paranoid 4 —.39 00  —.07 .67 
Grandiose Paranoid 5 —.30 .03 —.23 .02 .73 
Anxious-Disorganized 6 07 —.47 .00 .05 —.05 .74 
TABLE 4 
Average Correlations among Subgroups Identified by Method B 
Type 1 2 3 4 5 
Disorganized 1 73 
Б xcited 2 —.34 70 
ntropunitive 3 - 
Hostile Paranoid 4 a M on 


—.41 z = 
Grandiose Paranoid 5 —.33 a d 


In summary it can be said that b 
ally exclusive subgroups. Five о 
closely with respect to mean syn 


oth cluster methods yield mutu- 
Í the subgroups matched agreed 
drome score profile. While no cases 


an= 
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were assigned to different, subgroups by the two procedures, the 
sizes of the subgroups generated differed. This difference is judged 
to be a function of the lack of comparability of the critical limits 
set for the two procedures. Finally the subgroups generated appear 
meaningful and are easily interpretable. 
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SOME PSYCHOMETRIC CHARACTERISTICS OF CUES? 


RALPH F. BERDIE 


Office of the Dean of Students 
University of Minnesota 


Farty research on college and university students has been ex- 
tended to the environment of these students—the college and uni- 
versity itself. Information about institutions frequently is obtained 
by observing such things as size, type of support, social origins of 
students, financial resources, curricula, and distribution of students 
among curricula. Increasing use, however, is being made of infor- 
mation about the institution gathered from students who have 
expectations and perceptions of its more subtle aspects. 

The College and University Environment Scales (CUES), pub- 
lished by C. Robert Pace in 1963, has been used for this purpose. 

The manual for CUES provides a succinct description. 

“CUES consists of 150 statements about college life—teachers 
and facilities of the campus, rules and regulations, faculty, cur- 
ricula, instruction and examinations, student life, extracurricular 
organizations, and other aspects of the institutional environment 
which help to define the atmosphere or intellectual-social-cultural 
climate of the college as students see it. Students who take the test 
are asked to say whether each statement is generally TRUE or . 
FALSE with reference to their college: TRUE when they think 
the Statement is generally characteristic of the college, is a condi- 
tion which exists, an event which occurs or might occur, is the way 
most people feel or act; and FALSE when they think the state- 
ment is generally not characteristic of the college. The test is, 
—————— 

8 This project was cimus with support from the University of Minnesota 
aduate School and the Office of the Dean of Students. 
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therefore, a device for obtaining a description of the college from 
the students themselves, who presumably know what the environ- 
ment is like because they live in it and are a part of it. What the 
students are aware of, and agree with some unanimity of impres- 
sion to be generally true, defines the prevailing campus atmos- 
phere as students perceive it,” | 
The CUES was derived from the College Characteristics Index 
by Pace and Stern. The offspring instrument attempts to assess. 
only five basic dimensions (listed in Table 1), as compared to the 


TABLE 1 ү 
Comparisons of Two Scoring Methods for CUES for Ten University of Minnesota 
Groups, by Sex and College, and 48 College Groups as Reported by Pace 1 


Rank Order Correlations 


University of Minnesota Pace's Group 
НИЯ АЊА 


Scale N 210 М = 48 
1. Practicality .69 .95 
2. Community .85 .88 
3. Awareness .94 .98 
4. Propriety .65 .91 
5. Scholarship .89 .95 


thirty included in the ССІ. It contains only 150 items as compared 
to the 300 in the parent instrument, and presumably is a more 
efficient instrument, although the theoretical and practical value - 
of the information provided by the two instruments has not been | 
compared. CUES is designed for use with students who have at- - 
tended college long enough to have experienced its atmosphere - 
and the statistics in the manual are derived from responses based 
on these experiences, 

The Purpose of this report is to summarize some of the psycho- 
metric information obtained as part of a large-scale project in- . 
volving the administration of CUES to about 9,000 persons at the 
University of Minnesota during the academic year 1964-65. The 
major purposes of that research were to study differences in col- 
legiate expectations and perceptions of students entering varying 
colleges in the University and to attempt to identify correlates of 
these expectations and perceptions. In addition, changes over time 
and correlates of these changes were studied. The results of these 
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analyses will be reported elsewhere. The data, however, also pro- 
vided an opportunity to observe some psychometric characteristics 
of the CUES such as reliability, distributions, and effects of dif- 
ferences in scoring methods. 


Procedures 


During the late summer and early fall of 1964, all new entering 
freshmen in the University of Minnesota were asked to complete 
CUES with instructions to respond in terms of their expectations. 
Of the 9,015 entering freshmen in five University colleges, 7,168, 
or 85 per cent, completed the instrument prior to the beginning of 
classes, The parents of some of these students accompanied their 
children to the campus for the advanced orientation program and 
137 parents of 99 entering freshmen also completed CUES. At the 
end of the freshman camp which occurs two weekends before the 
beginning of classes in the fall, 702 of the freshmen previously 
tested again completed CUES. The 117 upperclassmen who served 
as camp counselors also took it. In April of 1965, near the close of 
the academic year, 271 of the original group of freshmen again 
completed CUES. This constituted a return of 74 per cent of the 
students asked to retake the test. About 20 additional freshmen 
were added to the retested group in May. 

"For large groups of entering freshmen other data also were 
available: high school percentile rank, score on the Minnesota 
Scholastic Aptitude Test, first quarter and first year grade point 
averages, parental education, and scores on the Minnesota Coun- 
seling Inventory. The scales of the MCI are labeled: V (Valid- 
ity), FR (Family Relations), SR (Social Relations), ES (Emo- 
tional Stability), C (Comformity), R (Reality), M. (Mood), and 
L (Leadership). 


Results 


Comparability of Scoring Methods 


Although two methods of scoring are described in Pace’s man- 
ual, the manual’s discussion of scores is based on only one. The 
author of CUES depends mainly on what he calls the “66--” 
method of scoring. This method is similar to the method used in 
opinion poll analysis and it implies that if two-thirds of the per- 
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sons in a group respond in a similar way to an item that response 
characterizes the institution. Scores are not obtained for individ- 


uals but rather the responses of individuals are tabulated and the | 


number of items in a scale answered in the keyed direction by 66 
per cent or more of respondents provides the institution's score on 
that scale. These scores in turn provide profiles for institutions 
which ean be compared to each other. When scores are available 
for a number of institutions, the reliability of the scores can be 
determined, the scores can be factor analyzed, and correlations 
can be obtained between institutional scores and other collegiate 
characteristics, 

The second method of scoring described by Pace consists of the 
more customary psychometric method whereby the number of 
items on each scale responded to in the keyed direction provides 
the basis for obtaining five scores for each person. When this 
method is used the means and standard deviations for groups 
within a college provide the institutional descriptions. 

The 66+ per cent method is convenient and appropriate when 
attention is centered on the institution rather than on the stu- 
dents within it. The method based on scores of individuals is more 
appropriate when one wishes to study the characteristics of in- 
dividual students related to CUES scores and to changes in CUES 
scores. The report presented here is based on this latter method, 
although a comparison was made between the two methods. 

А random sample of 1,591 persons containing students from 

each University of Minnesota college and sex was scored with 
both methods. This provided two sets of scores for ten groups—five 
of men and five of women. First, each answer sheet was scored to 
provide five scores per person and means and standard deviations 
were computed for each of the ten groups. Then, for each of the ten 
groups an item analysis provided the number of persons in each 
group who responded to each item in the keyed direction. Then 
for each group on each scale the number of items to which 66 per 
cent or more of the students responded in the keyed direction was 
counted to provide the “66” per cent score for that scale, Thus, 
for each of the ten groups à mean score on each of the five scales, 
based on individual Scores, and a “66+.” per cent score, based on & 
count of the items, was available, 


Table 1 shows the rank order correlations for the two methods 


q 
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of scoring for each scale. Included in the table are similar figures 
from Pace's manual based on 48 schools. 

Pace’s rank order correlations suggest that the method of scor- 
ing makes little difference in institutional comparisons. The Uni- 
versity of Minnesota correlations, based on a very small sample, 
suggest that the method of scoring may not provide results quite 
ав similar as Pace's results suggest. 

One possible explanation of the differences between the Pace re- 
sults and the University of Minnesota results is that the Uni- 
versity of Minnesota scores are based on expectations of students, 
the Pace results on the responses of students having had consider- 
able experience in the institution. 

The Minnesota findings suggest that when results from various 
CUES study are being compared, special attention should be paid 
to the method of scoring used in each study. 

Reliability 

Estimates of reliability were obtained for the Minnesota sam- 
ple in several ways. First, odd-even reliabilities, Spearman-Brown 
corrected, were obtained using a sample of 400 men and women 
entering freshmen in the College of Liberal Arts. Using a similar 
sample, the Kuder-Richardson (formula 21) reliability was de- 
termined separately for each sex. Next, an odd-even 8-В corrected 
reliability was determined for a random sample of 214 freshmen. 
Next, a freshman test-retest reliability was observed, using the 
CUES completed by 695 freshmen who first took CUES late in the 
summer and again took it early in the fall, just prior to the be- 
ginning of classes. Finally, reliability inferences can be based on 
the correlations between CUES completed by freshmen during 
their late summer orientation program and again at the end of the 
academic year. 

Again, the results for the University of Minnesota students can- 
not be compared directly with the results on reliability presented 
by Pace insofar as the Minnesota responses are based mainly on 
expectations whereas the Pace responses are based more on experi- 
ences, 

Table 2 presents the results for the reliability analyses and in- 
cludes a split-half reliability presented by Pace based on insti- 
tutional scores for his 48 colleges. 
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Obviously the test-retest correlations reflect both the reliability 
of the instrument and the stability of the relative perceptions. 

The correlations presented in Table 2 should be examined in 
light of the means and standard deviations in Table 3. Insofar as 
the sex and college differences in mean scores, although statisti- 
cally significant, were quite small, it appears safe to assume that 
the groups on which the reliability coefficients are based resemble 
those for whom the means and standard deviations were derived. 

The reliability coefficients suggest that CUES scores do not pre- 
sent reliable information about individuals, although the data are 
satisfactory for making group comparisons. The relatively low posi- 
tion of Scale 1, Practicality, using reliability measures based on 
internal consistency, suggests that the structure of Scale 1 may be 
more complex than the structure of the other scales. 


Distribution of Expectancy Scores 


Table 3 presents the means and standard deviations, by college 
and sex, for the entering freshmen instructed to respond to CUES 
in terms of their expectations. Analyses of variancé revealed that 
on each of the five scales the variance due to college differences 
was significant at the .01 level and on Scale 1, Practicality, Scale 
3,"Awareness, and Scale 5, Scholarship, the variance related to sex 
was significant at the .01 level. The entering freshmen of the Uni- 
versity of Minnesota do not have uniform collegiate expectations 
but rather expectations vary in terms of the sex of the student 
and the college he is entering. 


Interrelationships between Expectation Scores 


Inter-correlations between the CUES scores were determined for 
a random sample of 200 men and 200 women entering the Arts 
College. These inter-correlations are presented in Table 4. 

The inter-correlations for men and women show that the scales 
are relatively independent. Whereas Pace found a relatively high 
negative correlation between Scale 3, Awareness, and Scale 1, Prac- 
ticality, the University of Minnesota group did not show this in- 
verse relationship. The same difference is observed between cor- 
relations involving Scale 1, Practicality, and Scale 5, Scholarship. 

In interpreting the inter-correlations one must recall the rela- 
tively low reliabilities of the individual scales. Correcting inter- 
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TABLE 4 


Intercorrelations between CUES Scores for University of Minnesota Entering 
Freshmen and for Pace's Norm Sample of 48 Schools. 


Men 
N = 200 
Scale 2 3 4 5 
1. Practicality 46 .01 —.08 .06 
2. Community .34 .13 .31 
3. Awareness 17 43 
4. Propriety .30 
Women 
N = 200 
2 3 4 5 
Е SS лы ee 
1. Practicality .55 .25 .09 5 
2. Community .46 .28 .29 
3. Awareness 14 44 
4. Propriety .27 
x Pace’s Group 
N = 48 Institutions 
2 3 4 5 
1. Practicality .28 —.51 —.18 —.58 
2. Community .10 RC LUE: .00 
3. Awareness .08 .63 
4. Propriety .28 


correlations for these reliabilities would indicate that the char- 
acteristics assessed are more closely related than the correlations 
in the table indicate. 


Correlates of CUES Scores 


Analyses of variance on Scale 1 considering college differences, 
sex differences, mother's education, and grade point average at the 
end of the fall quarter revealed significant variance related only to 
college and sex and to the interaction between grade point average, 
college, and mother's education (significant at the .05 level). On 
Scale 2 the only significant variance was related to college. No 
source of variance other than college and sex provided significant 
variance on Scale 3. On Scale 4 the interaction between grade 
point average, sex, college, and mother's education, was significant 
at the .01 level. On Seale 5, the interactions between college and 
mother's education and college and mother's education and sex 
were significant in addition to the variance related to college and 
Бех, 
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These analyses suggest that if a relationship can be observed 
between maternal education and CUES and grade point average 
and CUES, these relationships are quite complex and apparently 
not consistent from scale to scale. 

Table 5 presents the correlations between the five CUES scores 
and high school percentile rank, score on the Minnesota Scholastic 
Aptitude Test, and fall quarter grade point average, for samples 
of students in the College of Liberal Arts. 


TABLE 5 
Correlations between CUES and High School Percentile Rank, Minnesota Scholastic 
Aptitude Test, and Fall Quarter Grade Point Average for 


University of Minnesota Freshmen 
Men 
MSAT GPA 
Scale N = 187 N = 186 М = 191 
1, Practicality 04 —.14 —.12 
2. Community —.02 —.08 —.09 
A Awareness 10 .19** 15 
. Propriety » 14 .01 11 
5. Scholarship 14 .03 07 
Women 
N = 186 М = 190 М = 195 
1, Practicality —.17* —.24** —.18* 
2, Community —.01 —.13 —.10 
. Awareness —.02 —.02 —.10 
4. Pro —.07 —.10 —.05 
5. Schol 15 —.01 —.01 
* Significant at .05 level. 
At .01 level. 


All of the correlations are low. Among the men, only one, that of 
19 between Awareness and College Aptitude Test score, is sta- 
tistically significant. Among the women, three correlations are sta- 
tistically significant, those between Practicality and the three 
academic indices, These three are all negative and low and suggest 
that а slight inverse relationship between attitudes related to prac- 
ticality and academic performance. In general, no substantial rela- 
tionships were found between expectations of the University and 
scholastic indices. Similar correlations for groups in each of the 
colleges supported this generalization. 

For freshmen for whom CUES scores and Scores on the Minne- 
sota Counseling Inventory were available, correlations were com- 
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puted between the two instruments and these correlations are pre- 
sented in Table 6. у 


" TABLE 6 


Correlations between CUES and Minnesota Counseling Inventory for 
University of Minnesota Freshmen 


Men (М = 135) 


Е CUES Scales 

MCI 1. Prac- 2. Com- 3. Aware- 4. Pro- 5. Scho- 

Scale — ticality munity ness priety — larship Mean 8D 
M ‚05 —.05 —.07 14 ‚07 3.48 1.98 
FR —.12 —.07 —.08 —.22* 00 7.83 6.09 
SR —.21* —.09 —.16 —.01 —.15 19.93 12.26 
ES —.07 .00 07 —.11 —.01 1150 6.15 
с .02 .03 .00 —.13 —.01 10.97 3.01 
R —.12 .00 04 —.24* —.03 8.29 6.19 
M —.18 —.07 —.05 ‚04 —.10 11.38 3.93 

L —.26* —.14 —.21* —.08 —.20* 11.4 4.55 

Mean 17,16 19.16 24.39 15.24 24.17 

SD 3.00 85 3.03 4.02 3.16 
Women (N = 85) 
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Among the men the highest correlation was —26 and among 
the women, —.29. For the men, six of the forty correlations were 
statistically significant at least at the .05 level but only two of the 
forty for the women. (High scores on the MCI reflect less social 
skill, greater maladjustment, less leadership.) 
\ The results in Table 6 suggest that expectations of the Univer- 
> sity are relatively independent of personality characteristics meas- 
е ured by the Minnesota Counseling Inventory. 
1 Finally, the correlations were computed between CUES scores of 
f students and parents. Although the shapes of the profile for parents 
and students were similar, and the means quite alike, only one of 
the child-parent correlations was significant. That was a correla- 
tion of .26 on Scale 1, Practicality, for students and mothers. These 
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correlations and means suggest that the expectations students and 
parents have of college reflect the attitudes prevailing in the wider 
community but that the similarity between a child's expectations 
and those of his parents is no greater than the similarity between 
the expectations of that child and those of other similar adults 
from the community. Perhaps the attitudes of high school gradu- 
ates toward college are influenced at least as much by the attitudes 
they encounter outside of the home as they are by the attitudes ex- 
pressed by their parents, 


Conclusion 


The analyses of CUES reported here suggest that interpretations 
of results reported in various studies must depend at least in part 
on the method of scoring used. The reliability of the CUES scales 
based on expectations appear quite adequate for purposes of group 
comparison but they are not sufficiently reliable to allow one to 
make inferences regarding the perceptions and expectations of in- 
dividual students. Descriptive statistics based on responses ex- 
pressing expectation are presented here to provide information 
about sub-groups in one complex institution, A university such as 
Minnesota is not homogeneous in terms of expectations of students 
or perceptions of students and faculty, 

A study of relationships between CUES scores and other vari- 
ables suggests that Tesponses to CUES are not much related to 
such things as high school percentile rank, ability test scores, col- 
lege achievement, and scores on personality inventories. If there 
isa relationship between the college expectations of students and 
the attitudes of their parents, this relationship most likely is not 


large and perhaps is quite complex. 
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THE PENSACOLA Z SCALE IN VOLUNTEER GROUPS 


MARSHALL B. JONES 
University of Florida 


Tum Pensacola Z Scale was constructed in the mid 1950's, when 
interest in authoritarianism and the California F Scale was most 
intense. The Z Scale is a 66-item forced-choice questionnaire de- 
signed to measure “personal autonomy.” 

The Е Scale is intended to measure the reverse of autonomy, i.e., 
authoritarianism. Its approach is through political and social atti- 
tudes of a conformist stamp (Adorno et al., 1950). Some items ex- 
press specifically conservative attitudes, for example, 

Al. The businessman and the manufacturer are much more im- 

portant to society than the artist and the professor. 
Others, like 

23. What this country needs most, more than laws and political 

programs, is a few courageous, tireless, devoted leaders in whom 

the people can, put their faith, 
are more general in their terms. Still, the reactionary sentiment 
is plain. 

The Z Scale was built against the F Scale as criterion (Jones, 
1957). In the naval aviation cadet population, the two scales cor- 
related —.43.1 The .Z Scale consists of standard personality items, 
for example, 

11. (A) You are sexually appealing. 

(B) You are faithful. 

29. (A) You are forgetful. 

(B) You have a meticulous memory. 


1Tn this review the Z Scale is treated as keyed for personal autonomy and 
against authoritarianism. In some of the earlier reports (Jones, 1956, 1957; 
Kaess and Witryol, 1957) the scale was keyed for authoritarianism. 
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The items of the Z Scale are less attitudinal than they are de- 
scriptive of the subject/s behavior, values, and self-concept. A 
subject who scores high on the Z Scale describes himself as self- 
confident and independent, friendly and with an inquiring, risk- 
taking habit of mind. 

The F Scale has been extensively criticized because all affirma- 
tive answers are keyed for authoritarianism (Bass, 1955). The 
tendency to answer items Positively or negatively, regardless of 
their content, is an established response set (Cronbach, 1946). In 
the F Scale this bias is completely confounded with the key. The 
more “acquiescent” a subject is, 1.е., the more he tends to answer 
any item in the affirmative, the higher his score on the F Scale. 
Theoretically, acquiescence and authoritarianism ought to be posi- 
tively related, Nevertheless, they are not the same thing. 

The Z Scale is free of this difficulty. First, the use of the forced- 
choice form rather than single-sentence items greatly reduces the 
role of aequiescence. Secondly, in 36 of the 66 items the autono- 
mous response is the first item in the pair, while in the remaining 
80 items it is tke second. 

In the eight years since its publication, the Z Scale has been 
used in a variety of volunteer groups. Mean performance in these 
groups varies over an extensive range and in a patterned manner. 


Results and Discussion 


In Table 1 various group results for the Z Seale are presented. 
The “astronauts” were 26 of the 31 men who were administered 
psychological tests in the selection program for Project Mercury. 
The “antarctic scientists” 
Antarctica during the International Geophysical Year, 1957-1958. 

The “college students” Were 182 men and 166 women in sopho- 
more psychology courses at the University of Florida? The 
“cadets” were under training at Pensacola, Florida, to become 
j & group of 766 were tested in 1956 and a second 


0 3 enlisted men” were 761 men under- 
going training for submarine duty at the U. S. Naval Submarine 
Base, New London, Connecticut, And the "retrainees" were 407 


? The author is indebted to Dr. Henry M. М. 


data involving Florida undergraduates, еа 
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prisoners at the U. S. Naval Retraining Command, Portsmouth, 
New Hampshire. 

The astronauts and the Navy groups, except for the second cadet 
sample, were tested in 1955 and 1956; the antarctic scientists 
were tested two years later; and the college students were tested 
in 1964, These differences in time of testing do not seem to present 
a problem, since the results in the two samples of cadets, nine 
years separated in time, are virtually identical. 

The Z Scale was built on an all-male population and until 
recently had not been administered to women. The results of the 
college students, however, make it clear that there is no sex differ- 
ence on the test. 

In the cadet population, the Z Scale correlates in the low .20's 
with the Psychological Examination of the American Council on 
Education (ACE). This result (Jones, 1957, p. 7) with a standard 
test of intelligence concords with the group results in Table 1. 
The astronauts had an average IQ of 133 on the Wechsler-Bellevue 
(Smith and Jones, 1962, p. 164). The cadets averaged an IQ of 
110 on the ACE. No intelligence-test results were available on the 
antarctic scientists or the college students. However, as a gradu- 
ate-level group, the antarctic scientists probably had an IQ level 
close to that of the astronauts; and the average IQ of college 
students generally falls between the cadet and sophomore astro- 
naut values. 

General Classification Test (GCT) scores averaged 57.2 for the 
enlisted men and 44.4 for the retrainees (Jones, 1956). The GCT 
is standardized in the enlisted population with a mean of 50 and a 
standard deviation of 10. Over all, therefore, there is a clear posi- 
tive association between intelligence-test performance and Z score. 

Educational level follows the same course. The enlisted men and 
retrainees are both high school groups, but educational level is dis- 


3 There is evidence that college sophomores generally may have an average 
on the Z Scale which, though still falling between the cadet and astronaut 
levels, is lower than the sophomore average at Florida. Kaess and Witryol 
(1957), after deleting two items from the 7, Scale (Items 2 and 18) because 
of their sexual content, obtained averages of 30.3 and 29.1 among 216 men and 
306 women from introductory psychology classes at the University of Con- 
necticut. In the cadet population the two missing items averaged an autonomy 
score of 13. If this amount is added to the averages in the Kaess and 
Witryol samples, the Connecticut average for both sexes would be only 
slightly higher than the cadet averages. 
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tinctly lower among the prisoners. The cadets are selected from 
the college population. They must have at least two years of col- 
lege; but most of them have not completed the bachelor's degree, 
frequently because of poor grades. (In the 1954 sample, many 
cadets had lost their student status with the draft because of 
poor grades.) Both the antarctic scientists and the astronauts 
were at the graduate level. An advance degree was required for 
the astronaut program, and almost all of the men in the antarctic 
sample either had an advanced degree or were working toward one. 

In some comparisons, intelligence-test performance is sufficient 
to account for the difference in mean Z-score. When covariance 
adjustments are made for the GCT, the difference between the en- 
listed men and the retrainees is no longer significant, In other 
comparisons, however, test performance cannot account for the 
difference. For example, the difference between the college students 
and the cadets is much too large to be explained in terms of IQ dif- 
ferences and the correlation between IQ and Z. 

In the case of the cadet samples, the low levels of autonomy 
may be partly’ attributable to faking. When 311 cadets were asked 
“to beat the test,” to put down what they thought was the “best” 
answer, regardless of its application to them, their average was 
0.8 points lower than under the standard instruction;4 the differ- 
ence was not significant. However, even if full weight were given to 
the drop, only a small part of the difference between the cadets 
and the college students could be attributed to faking. 

The relationship to volunteering is ambiguous in Table 1. The 
two distinctly volunteer groups, the astronauts and the antarctic 
Scientists, have high averages, but it is not certain that their levels 
are higher than those of graduate-level people generally. 

Table 2 contains Z-Scale Tesults in five groups of Peace Corps 
volunteers.5 The data are broken down by sex; no appreciable dif- 
ference between the sexes appears in any of the groups or in the 
totals, 

The means in Table 2 are uniformly high. The average of all 
five Peace Corps groups is half a standard deviation higher than 


ee À : 
up result was confirmed by Kaess and Witryol (1957) with college 


5The author thanks Drs, Charlotte Wh: i 
Jourard for permission to publish these Me n Uo Spar ey 
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TABLE 1 
Means and Standard Deviations of the Pensacola Z Scale in Various Groups 


Group N X с 
Astronauts 26 36.8 3.6 
Antarctic Scientists 56 36.4 7.5 
College students, male 182 34.1 7.0 
College students, female 166 34.0 6.8 
Naval air cadets (1956) 766 30.5 6.3 
Naval air cadets (1964) 323 30.3 6.0 
Enlisted men 761 29.4 5.3 
Retrainees 407 27.2 5.7 

TABLE 2 


Means and Standard Deviations of the Pensacola Z Scale in 
Five Peace Corps Groups 


Male Female Total 
Project N UAE to N ЕТ N VUKA 
Teaching 25 41.1 6.8 25 41.6 6.4 50 41.3 6.6 
Art 7 44.6 7.9 12 43.5 6.6 19 43.9 7.1 
Urban action 28 37.1 6.1 10 34.6 7.3 38 36.6 7.7 
Agricultural dev. 25 35.7 5.6 15°" 8607 6:8, 40 36.0 6.0 
Community dev. 29 42.0 5.5 19 40.3 5.6 48 41.3 5.6 
"Total 114 39.4 6.1 81 39.8 6.5 195 39.0 6.5 


their nearest competitors, the astronauts and antarctic scientists. 
It is two standard deviations higher than the Portsmouth re- 
trainees, Some of the Peace Corps groups, the art project par- 
ticularly, are very high. 

In the main, all of the Peace Corps groups were composed of 
college-aged people. In educational level they were, perhaps, two 
years more advanced than the college sophomores and two years 
less advanced than the astronauts and antarctic scientists. Never- 
theless, the Peace Corps averages range from almost one to a full 
standard deviation and a half higher than the college-sophomore 
average. Neither IQ nor educational level can account for eleva- 
tions of these magnitudes. 

The Peace Corps groups, except for the agricultural project, had 
majored principally in humanities. In this respect, they differed 
from either the astronauts or the antarctic scientists, both of which 
were physical-science groups. It seems likely, therefore, that com- 
mitment to the humanities accounts in part for the high averages 
of Table 2. It is also possible that the volunteer character of the 
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Peace Corps, partieularly in combination with its humanistic 
idealism, may play a role. Still, unambiguous evidence that volun- 
teering per se contributes to a high Z score is missing. 

In Table 3 the results of a study on volunteering among medical 


TABLE 3 
Means and Standard Deviations of the Pensacola Z Scale among Volunteer 
and Nonvolunteer Medical Students 
Group N x. c 
Volunteers 15 39.8 9.0 
Non-volunteers 40 35.0 6.9 
Total 55 36.3 7.8 


students at the University of Florida College of Medicine are pre- 
sented.? An entire class of students was invited to participate in a 
study on the effects of LSD. Of the 55 students in the class 15 vol- 
unteered. All of the students were given the Z Scale. As is clear 
from Table 3, the 15 volunteers averaged almost five points higher 
than the nonvolunteers. The result is significant beyond the .05 
level. x 

The mean for the medical students as a whole is roughly the 

Same as for the astronauts and antarctic scientists. This result 
might be taken to mean that the means for all three groups reflect 
educational level and IQ exclusively. There are several objections 
to this interpretation. First, all three groups are highly self-se- 
lected. Second, the astronauts, for example, are also a military 
group, older, married for the most part, and oriented around the 
physical sciences, The results of this review by no means exclude 
the possibility that a military group of comparable age and mari- 
tal status but uninterested in the exploration of space or any com- 
parable undertaking might have lower Z scores, 

, Eaess and Witryol (1957) found significantly positive correla- 
tions between the Z Scale and the Ascendance, Emotional Stabil- 
ity, Objectivity, Masculinity, and Friendliness Scales of the Guil- 
ford-Zimmerman Temperament Survey. On the Allport-Vernon 


correlation with Economic. 


° Тһе author acknowledges with thanks Mi 
permission to make use of these findings. Mis Mary Ann Cromer for her 
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These results, like those of this review, support the construct 
validity of the Z Scale. Personal autonomy presupposes the abili- 
ties to cope with reality as an individual, from which the ob- 
served relationships to IQ and educational level follow. The auton- 
omous inclination toward humanities and the "softer" disciplines 
and its implied interest in other people is antithetical to authori- 
tarian self-rejection and suspicion. In much the same way, the 
willingness of high-scoring subjects to volunteer, their adventur- 
ing qualities, are an acting-out of their own self-assertion. 
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ATTITUDE CHANGE MEASUREMENT IN ISOLATED 
WORK GROUPS"? 


LOYDA M. SHEARS 
California State Department of Mental Hygiene 
Pacific State Hospital 
Pomona, California 


Usme data gathered in isolated South Pole scientific stations, 
attitude items in The Opinion Survey (Gunderson and Nelson, 
1962) were evaluated for their ability to record changes in the 
way Navy men and civilian scientists saw themselves and their 
group over time, Attention was focused on the items which re- 
corded changes over time and discriminated among subjects on two 
occasions but which ranked the subjects differently on the two dif- 
ferent testing occasions. A set of eight items emerged for which it 
was possible to identify a consistent set of responses within sub- 
jects but non-consistent responses across subjects. This internally 
consistent set of items is presented as representing an attitude- 
change factor within the attitudes of these isolated living-working 
group members. 


Introduction 


'The Opinion Survey was developed by Gunderson and Nelson 
(1962) for use in the 1962 scientific expedition to the South Pole. 


3 Data were gathered while the author was а staff member of the U. 8. Navy 
Medical Neuropsychiatric Research Unit, San Diego, California, under Bureau 
of Medicine and Surgery Research Task MR005, 12-2004, Subtask 1. Opinions 
or assertions contained herein are the private ones of the author and are not 
to be construed as official or as necessarily reflecting the views of the Depart- 
ment of the Navy or of the naval service at large. Dr. Edward Alf's critical 
comments have been most beneficial in the preparation of this paper. The 
assistance of Mr. Frank Thompson in the data processing is gratefully 
acknowledged. 

2 Supported in part by the National Institute of Mental Health Grant No. 
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It was based in part on the earlier inventories devised by Zimmer 
(1958) for use during the International Geophysical Year (1957 
and 1958 expeditions) to study attitudes and group behavior. 

The small-station settings where The Opinion Survey was used 
to gather responses from Navy men and civilian scientists were 
extremely isolated in that weather conditions enforced complete 
physical isolation for the four to six months of winter darkness. 
Since The Opinion Survey was administered twice; soon after sum- 
mer support contingents had departed and again just prior to the 
arrival of the relief contingents, the items of this instrument 
could be assessed for their capacity for registering attitude changes 
in these isolated living-working settings over time. It is the pur- 
pose of this study to identify a set of such attitude-change items. 

The Opinion Survey was composed of 67 items, 40 of which 
loaded on the factors identified in earlier analyses of these data 
(Shears and Gunderson, 1966). Item responses were sought on a 
six-point scale stated in terms of extent of agreement or disagree- 
ment with statements describing the respondent's attitudes or their 
perceptions of'conditions in the group situation at the time of ad- 
ministration. The three small stations were manned by 22, 24, 
and 38 men, respectively, and matched responses were available 
for 63 persons covering two occasions. 

In order to identify the items sensitive to change over time and 
to differential change-directions among subjects three categories 
of items are described: 

1. Those items which elicited stable responses over time, dis- 
criminated among subjects on both occasions, and recorded 
little change from the first to the later occasion; 

2. Those items which recorded changes over time, discriminated 
among subjects on both occasions, and ranged them in simi- 
lar rank order on both occasions; and 

3. Those items which recorded changes over time, discriminated 
among subjects on both occasions, and ranged them in dif- 
ferent rank order on the two occasions. 

Only the items in the third category are of interest in this study. 

Such items would be the “change items" described by Bereiter 
(1962) in that the location of the subject on the pre-test for any 
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single item is a poor prediction of his location on the post-test 
for any single item. The difficulty whieh Bereiter refers to in his 
discussion of “what has changed" may be more easily received on a 
rational basis for these items than for the usual subjective attitude 
items; one may suspect that the social situation changed since 
there was evidence that persons in similar settings exhibited signs 
of emotional stress during the interval of isolation (Gunderson, 
1963). Thus, one may speculate that reported changes in attitude 
items are reflecting genuine alterations in the social climate of the 
group. Hopefully, both the means of assessing changes in group 
climate and a description of dimensions of the change will be 
available when the third category of items described above are 
identified. 


Method 


In order to identify the three categories of items mentioned 
above, it was necessary to describe item responses given on each 
of the two occasions as part of a single response-continuum, The 
median of the combined response distribution was established from 
each item and four-fold tables were constructed representing the 
frequencies and percentages of cases above the median (++) for one 
or both of the occasions; similarly, the categorization was made 
for the cases below the median (—). Three statistical evaluations 
for each item were carried out on the basis of these four-fold 
tables, First, the Median Test was used to evaluate the similarity 
of subjects’ responses to the item (Siegel, 1956). Then, to establish 
dissimilarity of subjects’ responses on the two occasions, the 
McNemar Chi-Square-Test-for-Change was used (Siegel, 1956). 
Finally, the extent of correlation between individual responses on 
the two occasions across the whole group was evaluated using an 
estimated tetrachoric correlation coefficient. 

The items which were characterized by non-significant Median 
Test evaluations, significant Change Test evaluations and low te- 
trachoric correlation coefficients (below .35) were further evalu- 
ated for internal consistency of subject-responses in the item-set as 
a whole. The status of each subject was described as “Improved” 
(++), “Unchanged” (0), or “Deteriorated” (—), with respect to 
each item. Correlations were computed among items so categorized 
across all subjects (Bereiter, 1963). 
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Results 


Eight items emerged from these statistical analyses with the 
characteristics required for inclusion in a Change Scale. The test 
of each item, together with the proportions of subjects whose re- 
sponses were above and below the median of the combined dis- 
tribution for the specific item, are presented in Table 1. In each case 
the Median Test values were non-significant, indicating that the 
items did not measure similarities in the subjects’ responses for 
the two occasions. The Chi-Square-for-Change values were sig- 
nificant at the .05 level of confidence or less in each case, indicating 
that differences in the subjects’ responses for the two occasions 
were being measured. The low tetrachoric correlation coefficients 
(not greater than .35) indicated that early status was minimally 
predictive of later status for every item included in the item-set. 
The objectives of identifying items which record changes over 
time and are sensitive to individual differences in the direction of 
this changed response-behavior have apparently been achieved. 

Success in fhe achievement of the objective of devising a meas- 
ure which could discriminate among subjects for their responses on 
both occasions, taken together, remains to be evaluated. Inspec- 
tion of the item intercorrelation matrix presented in Table 2 yields 
some evidence on this point. These correlations are based on the 
categorization of subjects which described their status on the first 
occasion with respect to their status on the later occasion (as 
described under Method), and may be thought of as indicative of 
the presence of a Change Factor measurable in terms of the item- 
set. The correlations are relatively high, but not so high as to in- 
dicate that all subjects responded identically to all of the items in 
the set. Thus, frequency of changed response status for the set as & 
whole may be thought of as more discriminating of “change” than 
any item taken alone. An estimate of reliability for the item-set 
as a whole equal to .73 was computed using the average item inter- 
correlation (Guilford, 1950). Responses to the item-set may be 
said to be reasonably internally consistent. 


Discussion 


A discussion of the meaning of an estimate of reliability based 
on internal consistency may be a convenient way to explicate the 
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TABLE 2 
Intercorrelations among Change Scale Items 


Item No. 1 2 3 4 5 6 7 8 
1 100 43 43 31 35 26 23 25 
2 100 22 38 08 30 26 31 
3 100 32 42 36 19 16 
4 1 26 34 13 02 
5 100 17 26 18 
6 100 08 03 
7 100 23 
8 100 


Note.—Decimal points are omitted, 


application of this measure. In the study of individuals, if subjects 
who are categorized as “Improved” with respect to one item fall 
into the same category for other items, the item-set is reliable in its 
identification of an “Improved” case. Once the internal consistency 
estimate has been established, as it seems to be for these items, it is 
possible to turn the statement around and say that the subject is 
more or less gurely identified as an “Improved” case. A similar 
identification may be made for “Deteriorated” cases. The extent 
to which the individual did, indeed, respond homogeneously will 
constitute an algebraic sum of items marked “Deteriorated” or 
“Improved,” and his score will locate him on a continuum of 
change-in-perception-of-group-climate. A set of scores for subjects 
so located may, then, be used as a criterion against which measures 
to predict the phenomenon measured by the Change Scale may be 
validated. Likewise, the Change Scale scores may be used to iden- 
tify related ratings and other individual performance indices gen- 
erated within the group setting. 

Since items which range the subjects in the same rank order on 
different occasions while measuring group change have been ex- 
cluded from the Change Scale Tendering it resistent to the effects 
of “Normal deterioration,” it is anticipated that, when a large 
proportion of group members see the group as “Deteriorated,” one 
may say that the group climate has actually deteriorated. Relative 
values for group (Contingent) means in Change Scale Scores 
would become more discriminating criterion scores for evaluating 
deterioration in isolated groups (Contingents) than means based 
on attitude scores reflecting “Normal deterioration.” Hopefully, а 
Deterioration Scale for groups would emerge from this application 
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of the Change Scale that could be used as a yardstick for measur- 
ing the capacity of groups to carry on their assigned missions. 

The content of the Change Scale items seems to deal with the 
status of group life as observed by the subjects. Qualitative evalu- 
ations of non-specific aspects of outcome-expectations for interper- 
sonal relationships appear to be important content elementa in this 
change-sensitive instrument. There was some limited reference to 
specific activities (such as “bull sessions”) as well. It is not surpris- 
ing that responses to such items would change since Gunderson 
(1963) found identifiable physical signs of emotional stress in men 
living and working in similar isolated circumstances. Such physi- 
cal signs (described as common complaints; nausea, heart or chest 
pains, insomnia, and the like) might be expected to have behav- 
ioral concomitants, and there is evidence (Seaton, 1962) that 
these concomitants are likely to appear in the absence of changes 
in more stable interpersonal measures of group structure; i.e., в0- 
ciometric choice, among groups of men undergoing the stress of an 
Arctic training exercise. Both of these researchers reported a gen- 
eralized deterioration of group status over time shd Seaton sug- 
gested that if “inadequate resources for social support and control 
...lead to breakdown in groups under stress, then men less need- 
ful of social support and control should fare better in such groups.” 
Since the present Change Scale has been developed in such a way 
as to remove items which were sensitive to a generalized tendency 
to deteriorate, the content of the items should reflect the individ- 
ual’s perception of social support in the milieu or the lack of it. 
Presumably, the men most in need of social support would report 
a deficit in this area, while those whose need for support was less 
would less readily report a deficit. 
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DEMAND CHARACTERISTICS ASSOCIATED WITH 
SEMANTIC DIFFERENTIAL RATINGS OF 
NOUNS AND VERBS: 


JERRY F. CATALDO, IRWIN SILVERMAN, AND JEROME M. BROWN 
State University of New York at Buffalo 


Recent attention has been focussed on motives which are spe- 
cific to the role of the Subject (S) in a psychological experiment, 
and the ways in which behavior related to these motives confound 
the observations made in psychological research. Orne (1962) has 
proposed that Ss are motivated to perform in the manner that they 
perceive the Experimenter's (E) hypothesis requires that they per- 
form, and that they are attuned to cues within the experimental 
situation which will suggest to them the desired mode of response. 
These cues, as they are perceived by the S, are termed "demand 
characteristics.” When demand characteristics are congruent to the 
Е’в hypothesis, spuriously positive results may be obtained, and 
on this basis Orne’s formulations have led to the reevaluation of 
existing data in several areas (Orne, 1959; Ward and Sandvold, 
1963; Orne and Scheibe, 1964). 

The present study is part of an ongoing program conducted by 
the second author which is directed, in part, to a further explora- 
tion of the areas of research in which consideration of demand 
characteristics may provide alternate explanations of the data 
obtained. This report describes two replications of one such study. 
In one, modifications were made in the design which, it was as- 
sumed, prevented the Ss from responding to demand characteris- 
tics. The study in question (Livant, 1963) was based on the con- 
tention that grammatical forms convey connotative meaning. Nine 


1 This study was supported by Grant No. GS 1023 from the National Science 
Foundation, for which the second author is principal Investigator. 
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English words, which could function as nouns or verbs, and 
nonsense syllables alleged to be foreign language words, were pre 
sented to Ss to be rated on nine scales of the Semantic Differential 
which comprised the three scales most highly loaded on the evaluas 
tive, activity and potency dimensions (Osgood, Suci and Tan: 
nenbaum, 1957). Ss*were instructed to rate each word twi 
thinking of it as a noun and, immediately after, as a verb. 
addition, ratings were made for the words “noun” and “verb.” Ti 
was hypothesized that nouns would be rated as more potent an d 
less active than verbs, and that there would be no difference on ihe 
evaluative scales. The latter two predictions were considered sub 
stantiated by the findings. The contention explored in the ргезе 
study was that these results reflected accurate assumptions regarde 
ing the E's expectations on the part of Ss, and their tendencies to 
comply with these. 


Method 


Livant described his sample as 21 “college students.” Each re= 
plication of this study employed 21 Introductory Psychology stu- 
dents with sexes represented about equally. Ss were administere d 
the Semantie Differential in groups of three, as in the original 
study. |. 

The first replication was similar in all possible respects to the. 
original. The second differed in that Ss rated all of the nouns firsti 
then these ratings were taken from them and they rated all of the 
verbs. Under this condition, Ss could not consult their noun proto- 
cols prior to rating the verbs and since there were 108 ratings each 
for nouns and verbs, it was assumed that these would not be те 
called readily. It was hypothesized that the differences observed 


by Livant would be obtained in the first replication but not in the. | 
second. 


Results and Discussion 


Table 1 presents the findings of the original study as they Were 
reported by Livant and a similar summary of findings for both | 
conditions of the present study. Frequencies refer to comparisons — 
between mean verb and mean noun ratings for each word across - 
the 21 Ss. For each dimension there were 12 comparisons on three 1 
scales, resulting in an N of 36. 
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TABLE 1 
Summary of Noun-Verb Comparisons for the Original Study and for the 
Two Conditions of the Present Study 


Dimension 
Evaluation Activity Potency 
Original Study Verb more 20(2) = 30(13) 19(5) 
positive \ 
Verb equals 1 0 1 
noun 
Noun more 15(2) 6(0) 16(4) 
positive 
Replication 1 Verb more 16(5) 28(10) 28(9) 
positive 
(Demand Verb equals 3 2 4 
characteristics noun 
included) 
Noun more 17(2) 6(0) 4(0) 
positive 
Replication 2 Verb more 18(3) 22(3) 17(2) 
positive 
(Demand Verb equals 3 1 5 
characteristics noun » 
excluded) 
j Noun more 15(1) 13(0) 14(2) 
positive 


Note—Number of significant comparisons (p < .05) is in parentheses. 


Livant’s conclusions were based upon the observation that the 
percentage of words for which verb forms were rated more positive 
than noun forms was significantly higher than would be expected 
by chance for the activity dimension, but not for the evaluative 
dimension. These results were obtained in the first replication (for 
the activity dimension, 2 = 3.85, р < .001). In addition, this per- 
centage was significantly better than chance for the potency di- 
mension (z = 3.85, p < .001), whereupon in the original study 
no differences were observed and predictions were for opposite ef- 
fects. 

One conjecture is that higher potency ratings for verbs are а 
further representation of Ss’ formulations of E's hypotheses in this 
paradigm and that Ss in the present study were more alert to such 
cues or motivated to act upon them. Ss for this study were re- 
сгаЦей from a psychology course and most of them had partici- 
pated in experiments previously. Either of these factors may have 
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distinguished them from Livant’s “college studen ” sample and 
contributed to their greater responsiveness to demand characteris- 
tics. 

The reasons for the differences in potency ratings between sam- 
ples may be more complex than this, however, since Livant’s group 
did show greater frequencies than would be expected by chance of 
verbs rated significantly higher than nouns and of nouns rated 
significantly higher than verbs. It may be that, for some reason, 
the demand characteristics with regard to potency ratings were 
qualitatively different between groups rather than stronger in the 
present sample. 

Whatever the explanation for the differences in potency ratings, 
the contention that demand characteristics account for the findings 
for both the activity and potency dimensions is supported by the 
data of the second replication. Here, the percentage of cases where 
verb ratings are higher than noun ratings is not significantly 
greater than chance for either of these dimensions, nor do these 
percentages approach this level. Further, the difference in this per- 
centage betweeh conditions of the study with demand characteris- 
ties included and excluded reaches statistical significance for the 
potency scales (z = 2.58, p < .05) and approaches significance for 
the activity dimension (z = 1.55, p < .12). Similarly, the per- 
centage of words for which mean verb ratings were higher to a 
statistically significant degree than mean noun ratings was sig- 
nificantly greater in the first replication than the second for the 
activity and the potency scales (z = 2.02, р < .05 and z = 2:13, p = 
< .05, respectively). 

One may contend that the procedure of obtaining noun and verb 
ratings separately in the second replication prevented the emer- 
gence of differences in connotative meaning. There is no indication 
in the relevant theories, however, that the different meanings 
evoked by different grammatieal forms becomes manifest only 
when they are considered in direct contrast to each other. We con- 
clude instead that these differences, insofar as they have been 
demonstrated with the semantic differential, can most feasibly be 
interpreted in terms of the S’s response to demand characteris- 
tics. It is considered that the present data give further support to 
Orne’s description of the motives attached to the role of Subject, 
and indicate perhaps the vulnerability of the semantic differential 
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to bias based on these motives when Ss are required to give com- 
parative ratings of the sort obtained by Livant. 


Summary 


Livant’s (1963) study, demonstrating that there are systematic 
differences in semantic differential ratings of the same word as a 
noun or a verb, was replicated under the original conditions and 
under conditions whereby the subject could not refer to his noun 
ratings while making his verb ratings. The findings indicated that 
differences in ratings were a function of the subjects’ perceptions 
of the experimenter’s hypothesis, and gave further support to 
Orne’s (1962) notions regarding “demand characteristics.” 
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A series of factor analyses of mood states, measured by means of 
adjective check lists and scales, have been reportéd since 1955. 
Green and Nowlis (1957) postulated four bipolar factors but 
emerged with eight factors for their check list of 140 adjectives. 
Utilizing 40 adjectives drawn from the longer Nowlis list, Borgatta 
(1961) isolated seven interpretable factors. Clyde assembled a 
mood scale which yielded six factors (1963). Lorr, McNair, Wein- 
stein, Michaux and Raskin (1961) developed an adjective mood 
scale on the basis of five postulated factors. Three factor analyses 
of the scale by McNair and Lorr (1964) yielded five replicated 
factors, An examination of the factors common and unique to the 
various analyses suggest that six factors have been replicated in 
two or more studies. The mood factors clearly confirmed can be 
labeled Depression, Vigor-Activity, Fatigue-Inertia, Tension-Anx- 
lety, and Anger-Hostility. A possible sixth factor variously called 
Concentration, Thoughtful, Confusion, and Clear Thinking seems 
to be a composite of two factors. Green and Nowlis also identified 
a Social Affection factor confirmed by Borgatta. This particular 
grouping, defined by adjectives such as kindly, warm hearted and 
affectionate, will be ignored as probably defining a stable person- 
ality trait rather than a mood state. 

The five replicated factors appear to be defined most clearly in 
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the McNair-Lorr analyses which are based on responses obtained 
from neurotic psychiatrie outpatients. Since such subjects are 
known to ascribe undesirable characteristics to themselves, addi- 
tional confirmation on a normal sample is needed. However, the 
main objectives of the present study were (a) to test whether the 
hypothesized mood states of Cheerful, Thoughtful, and Composed- 
Relaxed could be confirmed, and (b) to test whether the mood 
states conform to a circular ordering or circumplex (Guttman, 
1954). The mood factors in the hypothesized sequential order or 
the circumplex are as follows: Cheerful, active-energetic, angry- 
hostile, tense-anxious, thoughtful, dejected, tired-inert, and com- 
posed-relaxed. The notion of a possible circular ordering arose 
while searching for missing mood states. Each mood tends to have 
a bipolar opposite and to represent a mixture of two other moods. 
For example, cheerful is related to composed-relaxed and active- 
energetic, and its opposite seems to be dejected. It thus seemed 
plausible to postulate a sequential ordering of moods. 


» Method 


The mood scale consisted of 62 adjectives presented in a ran- 
domized sequence. Each adjective was rated on a four-point scale 
as “not at all,” “a little,” “quite a bit,” and “extremely.” The sub- 
ject was asked to rate himself on every word “the way he has felt 
during the past two days, including today.” Each hypothesized 
factor was defined either by seven or eight adjectives, 

The sample consisted of 166 male and 173 female undergradu- 
ate and graduate students from six universities, The mean age of 
the group was 23.9 while the range was from 18 to 36. 

The 62 variables were intercorrelated and factored by the mul- 
tiple group method (Guttman, 1952). Variables were grouped to- 
gether on the basis of the factors postulated. The residuals were 
sufficiently small to justify discontinuation of factoring. A few 
single plane rotations were then employed to achieve better simple 
structure, 


Results 
Mood Factors 


The variables correlating .25 or higher with each of the factors 
are given in Table 1. 
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TABLE 1 
Correlations of the Adjectives with Eight Mood Factors 


Factor 
Adjective A B c D E Е G H 


Elated .70 
On top of the world .69 
Excited .60 
Light-hearted .56 
Carefree .56 
Gay .54 
Cheerful .52 
Happy-go-lucky .51 
Pretty good .34 
Optimistic .33 


Active .62 
Energetic .56 
Full of pep .27 .54 
Alert .53 .29 
Vigorous .29 .51 
Lively .35 .50 
Enthusiastic 42 4 


Furious .68 
Annoyed .07 . 
Angry .65 
Spiteful 45 
Resentful E 
Ready to fight 44 
Bad-tempered E 
Grouchy +33 


Introspective 

Thoughtful —.25 
Contemplative 

Pensive 

Earnest .26 
Serious 31 
Preoccupied 


SRSRERB 
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Hopeless .61 
Helpless .59 
хвою .57 

nha; .36 
Tonay .82 
Blue .29 
Frightened .28 
Apathetic —.25 .26 
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TABLE 1—Continued 


Factor 
Adjective A B с р Е Е G H 
Weary .66 
Tired .66 
Sluggish .43 
Lethargic Л .38 
Lazy —.29 .38 
Listless —.26 .35 
Languid .28 
Calm .59 
At ease .52 
Composed 225.4 
Relaxed E 
Serene .34 


Nonchalant 


The first mood factor is called Cheerful as it measures an ele- 
vated mood. The factor most similar (Green-Nowlis) defines a 
Surgency (Playfulness) continuum. The only adjectives common 
to the two factors are carefree and lighthearted. Thus Cheerfulness 
is related to but not identical with Surgency. 

The second factor, labeled Energetic, has been identified in three 
prior studies by McNair and Lorr (1964), as Vigor-Activity. The 
third factor, Anger-Hostility, is also identical with one isolated 

. by MeNair-Lorr. The Green-Nowlis Aggression factor has only 
two adjectives in common with the present factor (annoyed, 
angry) and thus could not be the same. 

The fourth factor, called Tense-Anxious, has been identified 
three times by McNair and Lorr. The Green-Nowlis and the Bor- 
gatta studies isolated a factor entitled Anxiety which is definitely 
not the same. Their defining adjectives were clutched-up, peace- 
ful, startled, shocked and ashamed. The inclusion of a set of more 
suitable adjectives is probably the reason for the clearer definition 
in the present analysis. The fifth factor, called Depressed, i$ e8- 
sentially the same as the depression factor in the Green-Nowlis, 
Borgatta, McNair-Lorr and Clyde studies. 

Inert-Fatigued, the sixth mood factor, is essentially the same as 
Green-Nowlis’ Deactivation, the Borgatta Tired factor, and the 
McNair-Lorr factor of the same name. Thus Inert-Fatigued is 

confirmed again as a major mood variable. The seventh mood 
factor, labeled Thoughtful, was newly formed for the present 
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study. All adjectives assigned to the factor correlate most strongly 
with it. Green and Nowlis have a factor designated Concentration 
while McNair and Lorr label their factor Confusion. The Concen- 
tration and Confusion factors are defined by the following: able to 
concentrate, able to think clearly, efficient, confused, concentrat- 
ing, careful, attentive and serious. Thus Thoughtful resembles but 
is not identical with Concentration. 

The eighth factor appears to measure the hypothesized Relaxed- 
Composed mood. No corresponding factor has been isolated by 
Green and Nowlis or any other investigator. 

Thus of eight mood factors isolated, five confirmed factors iden- 
tified in prior studies. In addition three new factors hypothesized, 
namely Cheerful, Thoughtful and Composed-Relaxed, were also 
isolated. 


The Circular Order 


The circular rank order hypothesis is that qualitatively differ- 
ent traits in a given domain of behavior have a rank order among 
themselves without beginning or end, In a correlation matrix ex- 
hibiting a circular sequence, the highest correlations are next to the 
principal diagonal. Along any row (or column) the correlations 
decrease in size with distance from the diagonal and then increase 
again: In other words, the correlations of any particular vari- 
able decrease monotonically in size and then increase as a funo- 
tion of their sequential separation. 

Table 2 presents the correlations among the factors, Examina- 
tion of the table reveals that the circular order requirements are 
at best only partially satisfied. While the correlations along the 
diagonal tend to be largest, the correlation between Inert-Fatigued 


TABLE 2 
Correlations among Mood Factors 


A B с р Е Е G H 
a И OS ar TRE org SU IR ERE 


Cheerful A 1.00 

Energetic B .2 1.00 

Angry С —.00 .17 1.00 

Tense-Anxious D —.21 .16 .29 1.00 

Thoughtful Е —.19  .04  .14 50 1 

Depressed Е —.34 —.15  .42  .49 .33 1.00 
Inert-Fatigued G —.29 —.23 110 з  .38 35 1.00 
Composed Н .82 —.10 —.11 —.65 —.46 —.31 —.39 1.00 
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and Composed-Relaxed is negative. Furthermore, the- sequential ` 
increase and decrease is not monotonic for several variables. Thus, 
while the present data offer some support for the hypothesis, fur- 
ther trial will be necessary before a circular order can be estab- 
lished. It is evident that at least one mood variable is missing” 
between Composed-Relaxed and Inert-Fatigued. Perhaps it is a: 
Playful-Mischievous mood. It is also likely that an Excited factor 
can be shown to exist located between Energetic and Angry-Irri- · 
table. Further research is needed to test these hypotheses. 


Mood Factor Validity 


An important desideratum in any factor analysis is the collec- 
tion of evidence of construct validity. Towards this end the 
authors administered the mood scales to 70 psychology students 
at the University of Maryland under two conditions. First, all sub- 
jects were asked to describe their mood as of “right now” at the be- 
ginning of an ordinary class session. For the second condition, 
subjects were administered the mood scales just prior to the 
final examination, The presumption was that scores on Tense-Anx- 
ious and Angry would increase significantly while Cheerful and 
Composed would decrease significantly. 

Table 3 presents the results under the two conditions. Under the 
anxious condition students became less Cheerful, felt more fa- 


TABLE 3 
Changes in Mood Level from Neutral to Anxious Condition 
| Mean 
Anxious Neutral T P 

Cheerful 12.22 

1 Т 14.02 E : 
Energetic. 13.35 13.38 E 
Angry-Irritable 16.09 10.75 4,29 <.001 
Tense-Anxious 20.74 11.12 6.11 ' <.001 
Thoughtful 11.89 12.17 —0.68 x 
Depressed 14.46 9.80 4.53 «.001 
Inert-Fatigued 14.65 13.38 1.46 
Composed 12.11 17.08 : 


tigued, less composed, as well as more Excited, Angry, Tense and 
Dejected. Thus, in general, the hypothesized changes occurred in 
expected directions. 


Nowlis (1960) has reported a series of experiments designed to 
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determine the effects of various mood-inducing "treatments." Over 
400 college:men were given his scale 12 times. Four of these condi- 
tions were selected emotionally provocative motion pieture films. 
The adjectives defining his mood dimensions were found io vary 
concomitantly with the experimentally induced conditions, Lorr et 
al (1961, 1964) also demonstrated that the factor scales are sensi- 
tive to drug effects in psychiatric outpatients. 

` Zuckerman, Lubin, Vogel and Valerius (1964) developed a Mul- 
tiple Affect Adjective Check List designed to measure anxiety, De- 
pression and Hostility. Empirical criteria rather than factor anal- 
yses were used to select items and the scales were validated in 
several types of experimental situations. A stressful film resulted 
in significant increases on the Anxiety and Depression scales in 
females. A “surprise Examination” resulted in significant in- 
creases on all scales but the greatest increase in Hostility. Thus 
there is ample evidence that adjectives grouped into relatively 
unitary scales are sensitive to and reflect changes in mood induced 
by tranquilizers, experimental, and real life situations, The differ- 
ences found in students during “Neutral” and “Examination” ad- 
ministrations clearly were in response to real life factors, 

Finally, we call attention to the possibility that mood states we 
often assume to be mutually exclusive may in fact co-exist. That 
is, subjects can claim to feel both states at the same time, albeit to 
varying degrees, as is implied by the correlation matrix in Table 
2. More work is needed to investigate the relationships among 
mood states, as well as conceptual schemata for their categoriza- 
tion. 
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Tue Teacher Preference Schedules (TPS) are designed to assess 
"unconscious factors in career motivation among teachers" (Stern 
and Masling, 1958). The assessment of such motives, if they exist, 
would seem to be valuable in advancing our understanding of one 
class of factors related to the choice of teaching as a career, and 
also in increasing our understanding of variables actually predic- 
tive of teaching performance. The Schedules would seem then to 
have considerable potential value. 

Perhaps because of the recency of the Schedules, little evidence 
on their validity has been accumulated although a few available 
Studies show them to be highly promising (Jones and Gottfried, 
19632; Jones and Gottfried, 1963b; Jones and Nelson, 1963; Mas- 
ling and Stern 1963; Wallen, Travers, Reid, and Wodtke, 1963). 
The present study provides additional data on TPS validity. Re- 
liability was treated incidentally. The research was stimulated 
by the test items which seemed to the present writers to be po- 
tentially sensitive to social desirability factors. 

It is perhaps appropriate to attempt to simulate the framework 
from which the test authors worked. Two facets of their reasoning 
might have proceeded as follows: (1) it is possible to develop a 
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test reflecting unconscious motives fulfilled by teaching; (2) these 
motives are unconscious in the sense that we assume that the sub- 
ject is unaware of certain forces and preferences which shape and 
influence his actions in, and attitudes toward, certain specified 
situations. Point two is consistent with the conventional defini- 
tion of unconscious motives. For our purposes this definition need 
not be questioned. However, since in the TPS unconscious motives 
are inferred from conscious processes, it is appropriate, and indeed 
necessary, to demonstrate that any inferences about motives are 
independent of the conscious processes from which these inferences 
were drawn: the subject should in no way, through his manner 
of responding, be able to influence inferences about his unconsci- 
ous motives. The crux of the present research efforts revolved 
around attempts to determine the degree to which individual re- 
spondents could, under experimentally induced response sets, ma- 
nipulate their responses to distort, in directions specified by the 
experimenters, the image of their “unconscious” motives as re- 
vealed by the TPS. Such concerns fit the construct validity rubric. 

Each TPS form comprises 10 subscales. The notion that dif- 
ferent kinds of unconscious motives are reflected by these subscales, 
is of course, a central tenet of the test authors. The validity of the 
construct unconscious motives as underlying all the subtests can be 
subjected to empirical test. The construct validity of each subscale 
might be investigated also. This can be done by merely specifying 
kinds of behavior or responses which we might expect under cer- 
tain kinds of conditions. For present purposes, however, valida- 
tion of the construct presumed to underlie all subscales would 
seem to precede construct validation efforts with individual ones. 
If this first level of analysis meets with failure, there would be no 
need to pursue the validity of individual subscales; or if only 
certain subtests seem to hold up, analyses of these scales could be 
pursued. All of this assumes, of course, that a variety of relation- 
ships in the nomological net would be explored. In view of the 
variety of relationships that need to be explored, and with due 
concern for the many kinds of data that need to be gathered in 
construct validity studies, modest investigations such as the one 
reported here should perhaps be described as partial construct vali- 
dation studies. 

As has been noted elsewhere (Cronbach and Meehl, 1955) there 


Si 
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will be, frequently, little information to provide the theoretical 
scaffolding from which empirically testable hypotheses may be de- 
duced. This is perhaps most likely to occur with tests such as the 
present one and indeed most others, which have been developed 
independently of a systematic theoretical framework. Nevertheless, 
a beginning must be made. Frequently beginnings are based on 
deductions from such evidence as is suggested by the test rationale. 
In the present study information from four kinds of analyses has 
been martialled to bear on construct validity: 

1. Studies in which the subject, under experimentally induced 
sets, attempts to present a picture of laudable or nonlaudable mo- 
tives. Subtest scores obtained under set conditions are compared 
with scores obtained following regular TPS instructions. 

2. Inferences from correlational data. It was expected that cor- 
relations between scores obtained under the induced conditions, 
and those obtained following regular instructions would be small 
or negligible. 

3. Post hoc analyses of TPS scores of students comprising vari- 
ous curriculum groups. 

4. Analyses of studies which tested certain ad hoc hypotheses 
using the TPS. 


Method 


Subjects 


The Ss were 385 Miami University students enrolled in grad- 
uate and undergraduate courses in education: 234 students en- 
rolled in introductory educational psychology, and 151 graduate 
students enrolled in a course in educational tests and measure- 
ments. Undergraduate subjects were female, while graduate sub- 
jects were divided, about equally, between the sexes. Teacher Pref- 
erence Schedules (TPS) 

The TPS comprise two forms, G and A. Form G was developed 
to assess certain unconscious gratifications fulfilled by teaching, 
while Form A was developed to reflect attitudes in the same areas 
covered by Form G. Each form comprises 100 items, 10 items for 
each of 10 subtests. The subject responds to each item using six 
Likert-type categories which range from strong agreement through 
strong disagreement. The test authors’ description of subtests com- 
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prising the Schedules, and representative gratification (G) and At- 
titude (A) items are presented below (Stern and Masling, 1958; 
Stern, Masling, Denton, Henderson, and Levine, 1960) : 


1. Practical. These people utilize teaching as a means of achiev- 
ing pragmatic, utilitarian, tangible goals. Involvement in teaching 
is largely limited to the instrumental value the occupation has for 
such individuals in terms of hours, pay, vaeations, and similar 
sources of gratification. Since their primary investments are in 
nonacademie activities, the supporting attitudes must necessarily 
justify detachment. 

(G) “Finishing all my work during the school day, so that when 

I go home my time will be my own.” 

(A) “It wastes a lot of the teacher’s valuable time when he has 

to deal with problem children himself, instead of being able to 

refer them immediately to the principal, guidance officer, or 
school psychologist.” 


2. Status-Sfriving. These are the individuals, often from lower- 
class backgrounds, who see the teacher as having a high status, and 
for whom this perceived status is more important as a source of 
gratification than the teaching function itself. The significant atti- 
tudes in this case reflect preoccupation with professional dignity 
and propriety. 

(С) “Being selected to represent the teaching profession on а 

civic committee.” 

(A) “Teachers are among the cultural and educational elite of 

their community.” 


3. Nurturant. These teachers are characterized by a pervasive 
feeling of affection for children, and a desire to assist and support 
them. These teachers are warm and loving in their relationships 
with children, devote themselves freely to their pupils’ problems, 
and derive their greatest satisfactions from the reciprocal affection 
and gratitude of the children. They justify these activities on the 
grounds that a child’s greatest need is for love. 

(G) “Having pupils confide in one as a parent.” 

(A) “A pupil’s just need is for warmth and tenderness.” 


4. Nondirective. The motive here is to minimize the pupils’ ex- 
pression of dependency on the teacher. These teachers feel rewarded 
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to the extent that their pupils demonstrate capacities for self- 
direction, and they identify with an ideology which stresses respect 
for the integrity of the child and justifies the use of pupil-centered 
classroom techniques in the name of self-actualization. 
(G) "Inviting pupils to question my decisions and express their 
own opinions." 
(A) "Children should never be embarrassed or made to feel in- 
ferior by a teacher." 


5. Critical. For these teachers the central theme is a dedication 
to reform and improvement. These teachers are the organizers and 
erities of the profession, and they find gratification in the oppor- 
tunities which exist for championing an unpopular cause. Relevant 
attitudes involve criticism of contemporary practices in educa- 
tional administration and a generally negative view concerning the 
qualifications and motives of authority figures. 

(G) “Fighting for better pay, sickness and accident prevention, 

retirement provisions, ete., for teachers.” 

(A) “Many of the most important decisions aboyt schools are 

made by people who know nothing about education.” 


6. Pre-Adult Fixated. These people, feeling essentially inade- 
quate in the role of an adult, prefer the society of children to that 
of their own agemates. Their greatest pleasures in teaching come 
from sanctioned opportunities to participate vicariously, and some- 
times directly, in the activities of their pupils. Their attitudes re- 
flect an idealization of childhood and a justification for identify- 
ing with pupils. 

(G) “Being invited by pupils to join in their games and parties.” 

(A) “Communication between the teacher and his pupils is fa- 

cilitated if he can get them to accept him as a “pal,” sort of as 

one of them.” 


7. Orderly. The motive here is to codify and regulate behavior, 
minimizing the uncertainties inherent in personal interactions. 
These teachers are characterized by a compulsive preoccupation 
with rules and procedures, and they are most gratified by demon- 
strations of bureaucratic timing and organization in the classroom 
and school. They justify this in terms of the need for developing 
good pupil habits. 

(G) “Having pupils do over papers that are not neat.” 
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(A) "Every assignment must specify what the pupils are to do 
and how they are to do it." 


8. Dependent. The focus for these teachers is the inverse of the 
critic. Their personal insecurities are expressed in the form of a 
reliance on support from authority figures, and their major gratifi- 
cations in teaching come from close supervision and guidance. 
Supporting attitudes justify compliance and cooperation with au- 
thority on the grounds that superiors know best. 

(G) “Having a principal who takes a close interest in things I 

do.” 

(A) “A teacher can seldom go wrong in following his principal’s 

or supervisor’s advice.” 


9. Exhibitionistic, For this group of teachers the motive is ori- 
ented toward personal display and attention-seeking. These teach- 
ers achieve satisfaction from opportunities to entertain and capti- 
vate their pupils. They have a pervasive need to be admired, and 
they rationalize their exhibitionistic activities in the classroom on 
the grounds that clowning, "personality," and showmanship are 
essentially qualities for effective instruction. 

(G) "Being appreciated by the children for my sense of humor." 

(A) *A little clowning is a good way to hold the student's atten- 

tion and make the learning process more pleasant." 


10. Dominant. These individuals are concerned with reassur- 
ances regarding their own superiority and value. The subordinate 
status of the pupil is a Significant source of gratification for them, 
and they derive considerable pleasure from activities which keep 
the child in that position to the enhancement of their own. These 
behaviors are justified in terms of the need to maintain discipline. 

(G) “Running my class with a firm hand.” 


(A) “There are fewer disciplinary problems when pupils are 
somewhat fearful of the teacher.” 


Procedure 


Graduate subjects completed Form G under three conditions: 
(1) following standard instructions on both occasions (reliabil- 
ity); (2) using the standard instructions on the first testing occa- 
sion followed by nonstandard ones on the second (S manipulates 
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the items to present the most favorable picture possible) (the order 
in which the Ss received the tests was counterbalanced for a sec- 
ond group of Ss); and (3) using the standard TPS instructions, 
following by a set to present the most unfavorable picture possible 
(also counterbalanced for order). An interval of approximately 
two weeks existed between tests. Five different groups of Ss were 
used in meeting the experimental requirements just outlined. Spe- 
cific instructions introducing the three sets were as follows: 


Regular Instructions. The purpose of this schedule is to investi- 
gate teachers’ preferences for various aspects of teaching. The 
schedule consists of a number of statements describing many kinds 
of activities, events and situations relating to teaching. Teachers 
differ in their feelings about these activities, and this schedule has 
been developed as an aid to determining how great and how varied 
these differences are. It is important that you record your own per- 
sonal feelings about these activities, even in those cases where you 
think most teachers feel differently than yourself. Your responses 
will be processed and tabulated with those of other teachers by 
means of electronic devices, and no one will ever be permitted to 
examine the replies you have given here. Although your name is to 
be recorded on the answer sheet, this is solely for the purpose of 
cross-tabulating materials from the other schedules you have com- 
pleted. 

Please indicate on the special answer sheet the items you like, 
approve of, or would find pleasant to experience and, conversely, 
those that you dislike, disapprove of or would find unpleasant to 
experience. For purposes of this study it is not important whether 
or not you have actually done the things mentioned or have really 
had the opportunity to experience the events described. The sched- 
ule requires only an indication of your feeling about these events if 
you were to have the opportunity to experience them. (Specific di- 
rections for the completion of individual items follow.) 


Favorable Set. The purpose of this schedule is to investigate 
teachers’ preference for various aspects of teaching. The schedule 
consists of a number of statements describing many kinds of ac- 
tivities, events, and situations relating to teaching. Teachers differ 
in their feelings about these activities, and this schedule has been 
developed as an aid to determining how great and how varied these 
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differences are. Imagine that you are trying to create the impres- 
sion with a new principal that you are the best teacher in your 
field. Although your own feelings may differ, please respond to 
these questions as you would respond if trying to create the im- 
pression that you were the best teacher in your field. Responses will 
be processed and tabulated with those of other teachers by means 
of electronie devices, and no one will ever be permitted to examine 
the replies you have given here. 

Please indicate on the special answer sheet the items that you 
would like, approve of, or would find pleasant to experience, and 
conversely those that you would dislike, disapprove of, or would 
find unpleasant to experience if trying to create the impression that 
you were the best teacher. For purposes of this study it is not im- 
portant whether or not you have actually done the things men- 
tioned or have really had the opportunity to experience the events 
described. (This section is followed by the typical directions for 
test completion as given by the test authors.) 


Unfavorable Set. The purpose of this schedule .. . how varied 
these differences are. Imagine that you are trying to create the im- 
pression with a new principal that you are the worst teacher in 
your field. Although your own feelings may differ, please respond to 
these questions as you would respond if trying to create the im- 
pression that you were the worst teacher. Responses will be proc- 
essed and tabulated with those of other teachers by means of elec- 
tronic devices, and no one will ever be permitted to examine the 
replies you have given here. 

Please indicate on the special answer sheet the items you like, 
approve of, or would find pleasant to experience, and conversely, 
those that you would dislike, disapprove of, or would find un- 
pleasant to experience if trying to create the impression that you 
were the worst teacher. For the purposes of this study . . . the 

' events described. 

Differences in subtest scores as a function of set, of order of test 
administration, and of the interaction between set and order were 
analyzed using a Lindquist Type II analysis of variance design 
(Lindquist, 1953). Also, Pearson Product Moment correlation co- 
efficients between scores obtained under the various set conditions 
were computed. 
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The undergraduate Ss were used to test a secondary hypothesis 
—that differences in mean subtest scores existed as a function of 
the students’ grade, or subject area preferences. To provide data 
for this analysis each S completed TPS forms A and G, and pro- 
vided in addition certain information on their grade and subject 
matter teaching preferences. These latter data permitted categori- 
zation so that simple analyses of variance could be carried out for 
each of the ten subtests comprising each scale. 


Results and Discussion 


Effects of Experimental Sets. Statistical data pertinent to the ef- 
fects of instructions (regular vs. experimental), order of testing 
(1.е., the counterbalanced presentation of regular or experimental 
instructions), and the interaction between instruction and order 
are summarized in Table 1. Data in this Table reveal that only on 
two of ten subtests (Status Striving, and Dependency) could the 
subject significantly change his subtest scores to reveal a pattern 
of motives different from that revealed by his scores obtained fol- 
lowing standard TPS instructions. Although the subjects were able 
to effect statistically significant changes on these subscales we can 
only assert that the subtests are manipulable under the specific 
response sets induced. There is some suspicion that the set instruc- 
tions, e.g. “... you are trying to create the impression with a new 
principal that you are the best teacher in your field” have given 
certain responses a focus which they may not have had under 
other sets. Motives reflected by the two subscales in question are 
those reflecting preoccupation with professional dignity and pro- 
priety (Status Striving), and “compliance and cooperation with 
authority on the grounds that superiors know best” (Dependency). 
The content of these particular subscales, unlike that of most 
others, is probably highly correlated with the response set in- 
structions. There were two other subtests (Practical, and Critical) 
which, on a priori grounds, one would also expect to have been af- 
fected by the set instructions. However, scores on these subtests 
merely approached the predetermined .05 significance level (p < 
-10). Although not unequivocal then, these findings indicate that 
the respondents are capable, under certain specifically induced sets, 
9n some subtests, of presenting a picture of motives which may be 
interpreted as laudable. 
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The face validity of a request of an individual to present а ріс- 
ture of nonlaudable motives might be questioned. Seldom, if ever, 
does one intentionally put his “worst foot forward.” Such concern 
is not pertinent to the present investigation: the intent is to de- 
termine whether or not the respondent can "divine" the nature of 
the individual items. It would be assumed of course that the re- 
spondent knows the desirable response to make after having “di- 
vined” the items. In a related vein we assume that high agree- 
ment exists among teachers concerning desirable responses. If this 
is not the case, one would expect scores under the various experi- 
mentally induced sets to be random among individuals. Hence, 
they would not produce consistent test score differences when re- 
quested to orient their scores in a specified direction. Quite natu- 
rally, one would expect this to be an assumption of the test au- 
thors. The respondent may well be able to change the picture of 
his motives in the direction specified by Е without being aware of 
the nature of the motive which he was manipulating. S ability to 
identify motives presumably measured by TPS subseales was not 
investigated. 

Instructions requiring the respondent to present a nonlaudable 
picture of their motives was seen as complementing the analyses in 
Which the respondents presented a picture of laudable ones. The 
data in Table 1 reveal that the respondents were able to effect 
statistically significant score changes (p < .05) on nine of 10 
sub tests, 

No significant interaction between order of scale administration 
and instruction occurred for the set condition involving laudable 
motives. Such was not the case for the groups simulating non- 
laudable motives. On three subtests, Nurturance, Rebellious and 
Pre-Adult Fixated, interactions significant at the .01 or .001 levels 
occurred. These interactions demonstrated that for these subtests, 
order of administration was not independent of test performance. 


Correlational Analyses. Additional evidence on scale validity has 
been obtained by correlating the subject’s scores under the two 
conditions of test administration (ie., regular and experimental). 
If in fact the subtests are not manipulable, then correlations be- 
tween scores under the conditions of induced set and scores ob- 
tained under normal conditions of TPS administration should be 
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small or negligible. This was not the case (see Table 2). In 24 of 
40 instances there were statistieally significant correlations be- 
tween scores. Most frequently the correlations were of larger mag- 


TABLE 2 
Correlations between Subtests As a Function of Response Set 


Set 


Rl – К21 Ri – Н? Hi — R2 R1 — 128 Li — R2 
Subtest N = 43 М = 31 N = 31 М = 28 М = 18 


Practical .T25** .314 .470** —.281 .038 
8. Striving .857** .284 .572** —.393* —.324 
Nurturance .806** .634** .811** .026 —.025 
N. Directive .739** .456** .600** —.429* —.526* 
Rebellious .855** .659** .500** —.148 —.221 
P. Adult Fix, .768** .530** .740** —.071 —.654** 
Order. EP bos .504** .740** — .237 —.707** 
Depen, .810** .288 .639** —.441* — .380 
Exhib. .835** .700** .508** .067 .037 
Dominance .823** .424* .740** — .469* —.315 
*p < .05. 
**p < 01 


5 a КЫШ chc: method (two week interval). 
xc = first administrati di i ша ini " ins 
О tai obtain аан score. ion under regular instructions; second administration under in- 
*R1 — L2 = first administration undi i ше: Eos dcn anderine 
structions to obtain à low score. ler regular instructions; second administration under i1 


nitude (and positive) for groups completing the subtests under a 
set to give a laudable picture of their motives. One plausible in- 
terpretation of this finding is that even upon taking the test follow- 
ing regular instructions the respondents were presenting a picture 
of laudable motives. Such an interpretation would account for sub- 
ject failure, in most instances, when requested to do so, to produce 
scores significantly higher than those obtained following the regu- 
lar instructions. A second interpretation would be, of course, that 
the respondents could not manipulate the tests to present a dis- 
torted picture of their motives. 

Reliability. No reliability coefficients have been reported in 
TPS materials. Therefore, in the present study test-retest data 
over a two-week interval were obtained, using Form G of the 
Schedules. These data are reported in Table 2. Reliability coeffi- 
cients for individual subtests were distributed between a low of 
1725 and a high of .857. Coefficients of this magnitude are judged 


to be quite high for 10 items subscales reflecting noncognitive at- 
tributes. 


| 
| 
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Differences in Motives among Students by Curriculum Prefer- 
ences. It should be possible, on theoretical grounds, to predict 
the ordering of motives among teachers (or prospective teach- 
ers) in certain curriculum groups. One might suspect that some 
kinds of motives are much more likely to be dominant among per- 
sons working in (or preparing for) some teaching specialties than 
among persons working in (or preparing for) others. The present 
researchers did not believe themselves to possess the wisdom en- 
abling ordering of the six curriculum groups (e.g. kindergarten, 
elementary 1-3, and 4-6, high school social studies-English, high 
school home economics-industrial arts, and high school mathe- 
matics-physics) on each of the ten TPS subtests. Therefore, simple 
analyses of variance were done across curriculum groups by sub- 
test and TPS form (ie. A or G) without prior theoretical com- 
mitments. 

The results of simple analysis of variance, by curriculum groups, 
are presented in Table 3. Both for Forms A and G four subtests, 
Nurturance, Non-Directive, Pre-Adult Fixated, and Exhibitionis- 
tic were found to differentiate curriculum groups. The ordering of 


TABLE 3 


Grade or Subject Teaching Preferences 


Form G N = 234 Form A N = 234 
Mean Squares Mean Squares 
Subtest Between Within F Between Within 

Practical 84.34 45.84 1.84 82.22 43.09 1.90 
8. Striving 106.90 71.92 1.49 33.62 49.69 0.68 
Nurturance 176.76 54.52 3.24** 222.61 64.62 3.46 
N. Directive 112.76 50.16 2.26* 169.80 58.18 2.92 
Rebellious 59.72 84.05 0.71 63.89 57.74 1.10 
P. Ad. Fix, 562.90 67.38 8.35** 210.60 73.28 2.87 
Order. 19.19 40.99 0.47 27.45 50.89 0.54 
Гараа 38.36 52.78 0.73 107.84 58.42 1.85 
г 129.90 47.36 2.74% 133.46 52.71 2.53 

ominance 55.97 48.80 1.15 44.83 53.67 0.83 

* p < .05. 

** < ш. 


Scores on these subtests is in the direction which post hoc reason- 
ing would suggest: higher mean scores among students in kinder- 


garten-elementary curriculums; lower scores for those in secondary 
ones. 
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Other Empirical Research Using the Schedules. Empirical studies 
by several researchers present additional data bearing on TPS valid- 
ity (Jones and Gottfried, 1963a; Jones and Gottfried, 1963b; 
Jones and Nelson, 1963; Wallen, Travers, Reid, and Wodtke, 
1963). The first three studies just cited tested, respectively, as- 
sumptions about (1) differences in motives between satisfied and 
dissatisfied teachers of educable mentally retarded children; (2) 
differences in motives among persons interested in various special 
education programs; and (3) differences in motives between ma- 
ture college graduates preparing for careers in publie school teach- 
ing and those qualified but not enrolling for such preparation. In 
all instances statistically significant differences were found among 
groups’ responses on individual subtests, and these differences 
were ones either predicted beforehand, or ones amenable to certain 
ad hoc explanations. Whether or not the attributes measured re- 
flect unconscious motives, however, was not addressed by these 
studies. ў 

The three investigations just cited only illustrate the fact that 
certain TPS subtests differentiate among subjects comprising cer- 
tain groups. A more convincing demonstration of Schedule validity 
resides in studies which show that persons scoring in a given di- 
tection on specific subtests engage in certain operationally specifi- 
able behaviors differently than persons scoring in some other di- 
rection. Such a demonstration using the TPS in modified form has 
been given recently by Wallen, Travers, Reid, and Wodtke (1963). 
These researchers showed that statistically significant relationships 


existed between actual classroom behaviors and teacher responses 
to the modified Schedule, 


Summary 


ў The construct validity of the Teacher Preference Schedules, de- 
signed to reflect unconscious motives fulfilled by teaching, was in- 
vestigated. The rationale of the study was that if the test reflected 
unconscious motives, the subjects would be unable to manipulate 
the test items in a consistent and predictable manner to change the 
picture of their motives revealed by scores achieved following the 
standard instructions. Several other anlayses thought to be ger- 
mane to construct validity were done: (1) analyses of correlations 
between scores obtained under regular and induced sets; (2) analy- 


|| 
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ses of differences among subtest scores of subjects by curriculum 
groups; and (3) analyses of certain empirical studies which used 
the TPS. Form A of the Schedules, designed to measure verbalized 
motives in the same areas covered by Form G, was investigated 
incidentally. Also, the reliability of Form G was studied. 

In the first part of the study each subject completed TPS Form 
G under three conditions: (1) following standard instructions on 
both occasions (reliability); (2) using the standard instruction on 
the first testing occasion followed by the nonstandard ones (pre- 
sent a picture of laudable motives) on the second occasion; and (3) 
using the standard TPS instructions, followed by a set to present 
the most unfavorable picture possible. Test administrations were 
counter-balanced for order where appropriate. Analysis of the 
data using a Lindquist Type II analysis of variance design re- 
vealed that the Ss could consistently manipulate the items to pre- 
sent a picture of nonlaudable motives. However, they could not do 
80 with the same consistency in presenting a picture of laudable 
ones. À 

Correlational analyses revealed high and statistically significant 
relationships between subtest scores obtained following the set to 
present a picture of laudable motives and those obtained follow- 
ing standard instructions. This was less the case when the sub- 
jects were requested to present a picture of nonlaudable motives. 

Test-retest reliability coefficients over a two-week period ranged 
from .725 to .825. These were judged to be quite high. 

The results of simple analysis of variance of subtest scores of 
both Forms A and G revealed four subtests that differentiated 
among curriculum groups: Nurturance, Non-Directive, Pre-Adult 
Fixated, and Exhibitionistic. The ordering of mean scores by cur- 
riculum group (i.e., elementary, 1-3, 4-6, kindergarten, etc.) on 
these subtests was in the direction which ad hoc reasoning would 
Suggest: higher mean scores among students in kindergarten- 
elementary curriculums; lower scores for those in secondary ones. 

Analysis of a small number of empirical studies using the TPS 
revealed the Schedule capable of providing some support for theo- 
retieally based hypotheses; or at least of providing support for 
certain ad hoc interpretations. 

Over-all analyses of the several kinds of data have been in- 
terpreted as suggestive of scales capable of reflecting attitudes or 
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gratifieations which seem to be pertinent to the choice of teaching 
as a career, or to actual teaching performance. In view of the fact 
that when requested to do so the respondents were able to manipu- 
late the test items (Form G) to consistently present a picture of 
nonlaudable motives, and while they were also able to simulate a 
picture of laudable motives (to a more limited degree), the validity 
of the notion that a construct “unconscious motive” underlies the 
schedules seems equivocal. 
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Fon most measurement purposes, printed paper-pencil tests ad- 
ministered to groups using some pre-established limits on total test- 
ing time have traditionally been used. Under these conditions, no 
control of individual item exposure time is possible; Ss may pro- 
ceed through the collection of items at different rates using dif- 
ferent strategies for time allocation. In speeded tests, all items 
may not be attempted by all Ss. Such time-dependent variables 
pose well-known problems in item analysis and test reliability 
estimation and may also affect test validity adversely (Davidson 
and Carroll, 1945; Wesman, 1949, 1960; Thorndike, 1950; Guil- 
ford, 1954; Cronbach, 1960). Largely because of these problems, 
the general trend in both aptitude and achievement measurement 
has been toward the construction of relatively unspeeded tests. 

Allowing sufficient time for all Ss to finish solves the problem 
of uncompleted items, though it does not control differences in 
time-allocation strategies. There are conditions, furthermore, under 
which a speeded test may have an advantage in predictive validity 
(Lord, 1953) and it has even been argued that such tests provide 
an essential, perhaps the best, measure of mental ability (Eysenck, 
1953). Also, there are areas of research in which an ample but 
constant item exposure interval would be highly desirable. In some 
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reliability studies and in forms of instructional research where 
treatment repetitions or immediate and delayed recall criteria are 
required, two or more administrations of the same test may be in- 
volved. The need for uniformity across administrations would seem 
to dietate the control of item time (and sequence) as well as test 
time. In factor-analytie investigations of intellectual abilities, par- 
ticularly in the memory domain, there is reason to believe that 
increasing temporal-sequential precision in item presentation 
might add uniquely in obtaining operational, as well as dimen- 
sional, interpretations of mental functioning (Seibert and Snow, 
1965). The use of rapid-paced exposure control has also been rec- 
ommended to insure “first impression" responses (King, её al., 
1961) and to induce and measure stress performance (Gibson, 
1947). 

Thus, by fixing item exposure time, the problems of uncom- 
pleted tests and uncontrolled time-allocations seem soluble while, 
at the same time, progress in several areas of research which rely on 
test performance variables might be augmented. With such control, 
each item becomes a separately timed part-test in which any de- 
sired degree of speeding may be incorporated while assuring its 
attempt by, or at least its exposure to, all Ss. One readily available 
means to this end exists in the use of standard slide-projection 
equipment, with items appearing on separate slides, Until re- 
how however, serious consideration has not been given to this 
idea. 

A series of investigations begun by Curtis and Kropp (19612), 
(1961b) and continued by Burr (1963) stemmed from the practical 
possibilities of instructional television, where the televising of test 
materials forces control of item exposure and sequence. Various 
comparisons among printed, visual slide projected, and audio- 
visual Projected media over a wide variety of tests and inventories, 
including the study of variable exposure intervals based on item 
difficulty, supported the general conclusion that no substantial dif- 
ferences should be expected among these presentation media in 
terms of item and test characteristics. These studies sought to 
equate unspeeded projected and paper-pencil modes however. They 
did not attempt to push alternative forms of test administration 
to their respective speed limits, nor did they investigate the possi- 
bility that relevant individual differences among Ss, such as visual 
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acuity, reading speed, or verbal comprehension variables might 
be differentially related to performance under different testing con- 
ditions. The purpose of the research reported here, therefore, was 
io compare printed and slide-projected test administrations, under 
both relatively speeded and relatively unspeeded conditions, in 
terms of item and test statisties as well as correlations with se- 
lected characteristies of Ss. 


Procedure 


Two independent studies were conducted, the first planned with 
relatively small scope to serve mainly as a preliminary to the 
more extensive second study. While the two differed in minor de- 
tail, they may conveniently be combined for presentation. 

The investigation was carried out within the context of a larger 
project on informational learning from instructional films.! A 38- 
item multiple-choice achievement test, specially constructed to 
cover the content of a particular instructional film? was obtained 
from the parent project. In both Study I and Study II, all 
treatment conditions involved this test, administered immediately 
before and after the film showing. The conditions thus included 
identical instructional stimuli and differed only in the form of 
pre- and posttest presentation used. 


Subjects and Treatments. 


For Study I, an experimental condition was designed in which 
each of the 38 items was slide-projected with a 20 sec, exposure 
interval. The control condition presented the 38 items in standard 
printed page format with a total test time limit of 12.67 min. (38 
items X 20 sec.). A total sample of 64 paid volunteers, consisting 
of Purdue University underclassmen, was randomly divided to 
form the two treatment groups. Since some Ss failed to attend their 
group sessions, the remaining 58 Ss (29 males, 29 females) were 
unevenly distributed in the experimental (N = 27) and control 
(N= 31) groups. Both groups met for work in the same room on 
consecutive weekday evenings. Ss were assigned to seats randomly. 


ee 
1The parent study was supported by a grant from the Department of 
Health, Education, and Welfare, Office of Education, under the provisions of 
Title VII of the National Defense Education Act of 1958. 
2 Steam Turbine (Allis Chalmers) is а 25 minute color film which describes 
the theory and application of modern steam turbines. 
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For Study II, two experimental slide-projected conditions were 
defined, using 12 sec. and 18 sec. exposure intervals. The com- 
parable control treatments used 7.60 min. and 11.40 min. total 
time limits, respectively. A total sample of 206 paid volunteers, 
consisting of Purdue undergraduates was randomly divided to 
form the four groups. Of these, 195 Ss completed the experiment 
(128 males, 67 females), providing 51 Ss for the printed 18 sec. 
group and 48 Ss for each of the other three groups. The groups met 
for work in the same room on consecutive weekday evenings, two 
groups per night, again with random seating assignment. 

Prior to their assignment to groups, all Ss in Study II were tested 
for near-point and far-point visual acuity using the Orthorater. At 
the beginning of each session, the Tinker Speed of Reading Test 
(Tinker, 1955) and the Wide Range Vocabulary Test (French, 
Ekstrom, and Price, 1963), a measure of verbal comprehension, 
were administered. Four Likert-type attitude items, forming a scale 
for attitude toward the testing methods used, were administered at 
the end of each session. In addition, viewing distance in feet from 
each seat to the screen was assigned as a personal variable to the 
S occupying that seat, and the sex of each S was also recorded. 


Apparatus 


Items were typed on blank cards and photographed for mounting 
as 35 mm. slides. A Kodak Carousel slide projector (Model 580), 
with automatie timer and slide changing mechanism, was situated 
in an insulated projection booth. In Study I, the slides were pro- 
jected a distance of 30 ft. onto an 8 X 6 ft. screen. The automatic 
timer provided a slide change every 20 sec. In Study II, the pro- 
jected distance was 40 ft. onto an 11 X 6 ft. screen. Here, the 
change mechanism was triggered manually in coordination with a 
stop watch, since 12 and 18 sec. intervals were not available auto- 
matically. Informal practice and independent comparisons were 
sufficient to convince the Hs that the manual procedure was as ac- 
curate and probably more accurate, than the automatic timer, 
which seemed to contain a negative constant error of up to one 
sec. The slide projector did, however, inadvertantly skip an item 
for one treatment condition, and therefore, that item was dropped 
from the analysis of Study II. Time lost in slide changing was 
minimal and assumed to be counterbalanced by time lost in eye 
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shifts and page turning in the control conditions. For all slide con- 
ditions, the testing room was dimly illuminated at approximately 
3 ft. candles at the source. Full house lighting was used for the 
control groups. 


Results 


Test Characteristics 


Mean performance on the pretest in both studies indicated pre- 
dominantly chance responding in all treatment conditions. These 
data have not been reported separately here, although pretest per- 
formance in Study II was included as part of a gain score for cor- 
relational analyses to be reported below. 

Relevant test statistics for posttest performance in all condi- 
tions appear in Table 1. The mean difference obtained in Study I 
was obviously insignificant. Mean performance data for Study II 
provided a 2 X 2 factorial design in which unweighted means 
analysis was performed. A summary of this analysis may be found 
in Table 2. Since the interaction term was judged significant, sim- 
ple main effects were computed for presentation methods at each 
level of exposure interval. Table 3 presents these comparisons. 

Figure 1 shows the relations among treatment means for Study 
II graphically. While it might be argued that the populations from 
which Ss were drawn differed between Studies I and II, it was be- 
lieved instructive at this stage to assume their comparability and 
plot a single time graph. Thus, means for Study I were recom- 


TABLE 1 
Test Statistics for Studies I and II 
Study I Study II 
20 Second 12 Second 18 Second 
Exposure Exposure Exposure 
Print Slide Print Slide Print Slide 
Mean 19.87 20.04 13.19 14.25 17.80 16.02 
Standard Deviation 3.77 5.18 4.18 3.18 3.68 4.49 
KR-20 Reliability .51 .75 .64 .82 .49 65 
Standard Error of 2.64 2.59 2.49 2.63 2.02 2.04 
Measurement 
Average Item 252.53 86,489 48.4 
Diffieulty 
Range of Item .10-.97 .07-.96 .00-.85 .02-.85 .10-.86 .06-.85 
Difficulty 


eS tiere pedes i dL DE 
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TABLE 2 
ANOV of Study II 
Source of Variation SS df MS F 
————ÁÁ— 0 a СЕ у... 
Method of Presentation 6.33 1 6.33 41 
Item Exposure Time 494.02 1 494.02 32.35** 
Interaction 99.88 1 99.88 6.54* 
Within Cell (Error) 2917.35 191 15.27 
Note,—i = 48.72 
*p < .05 
**p < 01 
ТАВІЕ 3 
Analysis of Simple Effects 
Source of Variation 8S df MS F 
Oe ce A Sa xoc te pof] 
Methods for 12 Second Exposure 27.28 1 27.28 1.79 
Methods for 18 Second Exposure 76.98 1 76.98 5.04* 
О 9 
*p < 05 


puted excluding the item which had been dropped in Study II, and 
entered in Figure 1. 

Differences between the two populations would probably affect 
the elevation of these points but it is unlikely that the difference be- 
tween the points would be modified appreciably. Thus, with printed 
test administration, a roughly linear relation was apparent be- 
tween item exposure interval and mean test performance, amount- 
ing to an increase of approximately .75 mean score points per ad- 
ditional second of exposure within the range of exposures used. For 
slide-projected presentation, however, a more rapid initial decline 
in mean performance occurred between 18 and 20 sec. exposures, 
resulting in the significant difference between presentation methods 
obtained at the 18 sec. level in Table 3. Below 18 secs., mean per- 
formance in the slide condition was less affected by further reduc- 
tions in exposure time. 

From Table 1 it may also be noted that slide presentation led to 
consistently higher KR-20 reliability estimates in the two condi- 
tions where KR-20 is appropriately applied. In the 12 sec. print 
group, where many Ss did not complete the test, KR-20 provides 
spurious estimates of internal consistency. The reduction in ap- 
parent homogeneity for the 12 sec. slide group probably indicates 
increased chance responding at this exposure level, even though 
all Ss completed the items. 
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Figure 1. Mean posttest performance as a function of item exposure time for 
print and slide test presentation conditions. 


Item Difficulty Indices 


Within exposure levels, individual item difficulty indices ob- 
tained under the two presentation conditions were compared using 
t for the significance of a difference between proportions. The .05 
level of significance was used throughout these comparisons. For 
the 20 sec. exposure level, no differences were isolated. In the 18 
sec. exposure level, the difficulty levels for items numbered 1, 6, 17, 
and 25 were significantly higher under the print conditions, while 
for item no. 22, a significant difference in the reverse direction was 
obtained. In the 12 sec. exposure condition, items numbered 4, 5, 6, 
and 7 were judged significantly easier in the printed condition 
While items numbered 30 thru 38 were significantly more difficult in 
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the printed condition.? For Study II, all item difficulties were trans- 
formed to standard deviation units using Davis’s suggestion 
(Davis, 1951). The resulting difficulty indices ranging from 0-100, 
were averaged for each third of the test (first 12 items, middle 13 
items, last 12 items) under each treatment condition. A graph of 
ihese averages is shown in Figure 2. The effects of speed upon 
printed test performance is clearly apparent. Fixed item exposure 
provided relatively constant item diffieulty averages across the 
stages of test completion. With uncontrolled item exposure, Ss 
seemed to spend too much time per item in the first third of the test 
and thus had too little time remaining in the last third. Similar 
averages were computed for Study I, although they were not in- 
cluded in Figure 2, They followed roughly the form of the 18 sec. 
exposure groups with a slight but consistently higher average item 
difficulty. 

The transformed item difficulty distributions for each Study IL 
treatment were also intercorrelated over the 37 items, the resulting 
matrix appearing in Table 4. The 12 sec. slide condition provided 
difficulty indices more comparable in form to those of either 18 sec. 
condition than did the 12 sec. print condition. 


TABLE 4 
Intercorrelations of Item Difficulties 
12 Second 18 Second 18 Second 
Slide Print Slide 
12 Bec. Print 4 . 
12 Sec. Slide .87 .89 
18 Bec. Print .89 


18 Sec. Slide 


Reference Variables 


Seven variables representing potentially relevant personal char- 
acteristics of Ss were available for correlation analyses of per- 
formance under the four treatment conditions of Study II. Three 
forms of the dependent variable were used: pretest score, posttest 
score, and residual gain score. The latter represents posttest per- 
formance with the variance predictable from the pretest partialled 


3 All individual item difficulties for both studi 
(Heckman, 1965). udies are presented elsewhere 


у 
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AVERAGE ITEM DIFFICULTY 
8 


CH 18sec. PRINT 
E-A 18 SEC. SLIDE 
10 ©—QO 12SEC. PRINT 
©--© 12SEC. SLIDE 


1 2 3 
THIRDS OF TEST 
Figure 2, Average item difficulty computed for separate thirds of the posttest 
for Study П treatment conditions. 
out (see Dubois, 1962). The matrices of intercorrelations between 
reference variables and criteria are presented in Table 5. There is 
no apparent explanation for the high negative correlation between 
near acuity and the dependent variables. 


Attitude Toward Testing Procedures 


There were no practical differences in the attitudes of the four 
groups, but the 12 sec. slide group indicated they felt a certain 
amount of stress. 

Discussion 

More intensive investigation of the curves of Figure 1, using a 
wider range of exposure intervals with several kinds of test ma- 
terial, should provide useful information concerning the interac- 
tion of test performance and test time. The particular exposure 
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TABLE 5 
Intercorrelations of Dependent Variables and Reference Variables 
Attitude 
Toward 
Variable Test Methods Pretest Posttest Residual Gains 
ee иын ee 
Vocabulary —.05 —.03 .26 .02 .34 .31 .27 .81 
—.03 —.03 .20 .18 .36 .28 .81 .23 
Viewing Distance .07 .00 Лі —.10 -—.02 —.16 —.05 —.14 
.07 —.40* .06 —.18 .20 —.31** .25 —.29** 
Far-Point Acuity |—.05 —.13 .16 .20  —.18 .05 —.22 .00 
.12 —.02 —.17 .09 —.20 —.12 —.15 —.12 
Near-Point Acuity|—.29 —.10 —.15 04 —.01 —.38 .02 —.39* 
— 13 11 .28 .10 .13 .10 .01 .08 
Reading Speed 04 —.11 .16 —.06 .29 .28 .26 .30 
—.07 .08 .08 .81 .31 .19 .30 pal 
Sex —.18 —.06 .09 .19 .03 .13 —.01 .09. 
—.12 .21 .16 P vd .39 .27 .36 .24 
Attitude Toward .01 —.06  —.06 .13 —.03 14 
Test Methods —.06 —.19 —.02 .16 .00 21 


Pretest .26 .22 —.01 .00 
° 41 .23 .01 .00 

Posttest .95 .98 
.91 .97 


Residual Gains 
Note.—At each variable intersection there are four correlations, one for each treatment condition. The correlations 
re located as follows: upper left correlation—12 sec, print group's correlation; upper right—12 sec. slide; bottom 
eft— 18 sec. print; bottom right—18 sec. slide. The significance of a difference between correlations within exposure 
vels is indicated at the right of each pair of correlations, Although the individual correlations are not marked for 
eee Sates of .285 (df = 46) is required for a significant departure from zero. 
жр < 001 
times found critieal in the present study are probably specific to 
particular tests. However, the general shape of the curves, if repli- 
cable with different tests, suggests a linear time-performance rela- 
tionship when item exposure is left uncontrolled but a trend ap- 
proximating some step-like function with controlled item exposure. 
Coupled with the test and item statistics collected here, these data 
provide decision rules for modifying the standard time limits of 
published tests without affecting other test characteristics ad- 
versely. Given a particular test, or test battery, and a limited time 
in which to complete administration, the psychometrician might 
be able to preserve reliability and validity characteristics while co- 
ordinating and/or reducing time limits. In fact, test internal con- 
sistency seemed to be increased by item exposure control in this 
study, until very short time limits were reached. Figure 2 suggests 
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in another way that, with controlled item exposure, items can be 
speeded without “speeding” the total test. For the most part, those 
personal characteristics of examinees which were thought most 
likely to confound primary measurement purposes, seemed not to 
effect appreciable differences in test performance among the various 
presentation conditions investigated here. Their influence does not 
account for the higher internal consistency estimates obtained 
under relatively unspeeded slide-presentation conditions. The dis- 
tance at which slide-projected items were viewed, however, was 
found to be correlated negatively with both test performance and 
attitude toward the slide presentation method. While larger pro- 
jected images or smaller administration rooms might overcome 
this effect, further work with tests on slides will need to consider 
either statistical or experimental control of the viewing distance 
variable. 

Aside from practical implications, it would appear that the as- 
sumptions underlying several conventional item and test analysis 
procedures are met more adequately by items administered under 
controlled exposure conditions. The temporal and sequential char- 
acteristics of items are frequently ignored in research on, or using, 
test measurement. Yet, information on the effects of these char- 
acteristics should be important for theoretical as well as practical 
purposes, There should be room in test theory for the notion that 
every test has a speed limit at which its measurement function 
breaks down and that this limit may be altered by the medium 
through which test stimuli are communicated. 


Summary 


Two studies were conducted in which slide-projected presenta- 
tion of achievement test items was compared with conventional, 
printed presentation of the same items in terms of item and test 
performance characteristics. One study used a 20 sec. item ex- 
posure interval for slide projection and its equivalent in total test 
time for the printed administration, while the other study used 
both 12 sec. and 18 sec. exposure intervals and their printed, total 
time limit equivalents. The combined results suggest that, com- 
pared with traditional modes of test presentation, controlling item 
exposure by slide projection (1) provides more internally con- 
sistent measurement, except at very short exposure times, (2) per- 
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mits the use of shorter, more speeded, total time limits in tests. 
while allowing all examinees to complete every item, and (3) pro- | 
duces some differential correlation between test performance and 
viewing distance and near-point visual acuity but not between : 
test performance and far-point visual acuity, reading speed, or 
verbal comprehension. A general relationship between test per- | 
formance and test time, differing between controlled and uncon- 
trolled item exposure conditions, was hypothesized. Practical im- | 
plications of the findings were discussed. 
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A NOTE ON THE USE OF PAPER-PENCIL ITEMS 
TO PROBE COGNITIVE AND AFFECTIVE PROCESSES 


JOSEPH ALLMAN AND MILTON ROKEACH: 
Michigan State University 


Ons of the more nagging conceptual problems in attitudinal re- 
Search has been the nature of the relationship between cognitive 
and affective processes. This conceptual ambiguity generates op- 
erational difficulties in the construction of questionnaire items 
which are predicated by varying and often contradictory assump- 
tions concerning this relationship. This research has been carried 
out in an effort to clarify the degree to which differential responses 
can be elicited by varying only the cognitive and affective orienta- 
tions of the items in a paper-pencil questionnaire. 

The thinking which has guided this effort is that of Rokeach 
(1960), who argues that: 


In everyday discourse we often precede what we are about to 
вау with the phrase “I think... ,” “I believe... ,” or “I 
feel. . . ." We pause to wonder whether such phrases refer to 
underlying states or processes which are really distinguishable 
from each other. . . . The fact that these phrases are often 
(although not always) interchangeable suggests the assumption 
that every emotion has its cognitive counterpart, and every 
cognition its emotional counterpart. (p. 8) 


A general hypothesis might be that if cognitively oriented 
phrases such as "I believe . . ." and “I think . . .” or such affec- 
tively oriented phrases as “I feel . . .” do refer to some underlying 
and distinguishable processes then it should be possible to observe 


1 This work was facilitated by a grant from the National Science Foundation 
to the junior author. 
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variations in responses resulting from such differences in the word- 
ing of the items. More specifically, assuming that these processes 
are functionally equivalent rather than independent, the following 
hypotheses seem reasonable: 


1. Responses to Likert-type items will not vary as a result of 
manipulating the cognitive and affective orientations. 

2. The unidimensionality of a particular scale will not be af- 
fected as a result of manipulating the cognitive and affective orien- 
tation of scale items. 

3. Similar patterns of correlations will be found when a set of 
cognitively oriented items and a set of affectively oriented items 
are compared with an outside criterion. 


Research Design 


A test-retest format was used in the investigation. The instru- 
ments included a paper-pencil questionnaire,? a four item Guttman 
scale of altruism with a known coefficient of reproducibility, and a 
set of aspiration-to-activity items. The student opinion question- 
naire consisted of sixteen items to which the subjects were to re- 
spond on a six-point, agree-disagree Likert scale. There were four 
conditions to be examined using these same sixteer | ‘ems on four 
different occasions with the same subjects. Cond . a 1- four items 
prefaced with the phrase “I think that . . ."; Condition 2- four 
items prefaced with the phrase “I believe that. . .”; Condition 3- 
four items prefaced with the phrase ^I feel that . . .”; and Condi- 
tion 4- four items without preface, to serve as control. The four 
types of prefaces to the four sets of items were rotated in each of 
the four tests in the series, so that after the four tests all sixteen 
items had been presented under all four conditions. This proce- 
dure allowed for the use of two types of scores for each of the sub- 
jects. The “test score” is the score each individual obtained on 
each of the four testing occasions (each testing occasion including 
all four conditions). After the four tests had been given to each of 


? The opinion questionnaire probed such topics as communists speakers on 
campus, racial problems, activity in campus organizations, preferences as to 
amount and kind of education necessary, and problems concerned with course 
organization, grading, testing, etc. 

3 An individual's score is the sum of the responses to the six-point Likert 
type scale for the sixteen items. A minimum score would be 16 x 1 = 16, 
and the maximum score would be 16 x 6 = 96. 
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the subjects it was possible to calculate four “condition scores.” 
This was the score that each subject received on the sixteen items 
under the same condition. Thus, the four test scores represented 
the way that the individual responded to the sixteen items on four 
separate occasions when the items were prefaced with the different 
phrases in the same questionnaire. The four condition scores rep- 
resented the way the individual responded to the sixteen items 
(four items in each of the four separate testing sessions) when the 
items had all been prefaced with “I believe that ... ,” “I think 
that...,” “I feel that... ,” and no preface. 

The second instrument used in the research was a four item 
Guttman scale designed to probe the altruistic beliefs of the re- 
spondents.* In this research the focus was upon the previously de- 
termined unidimensional nature of the four items rather than on 
its validity. The four items of this scale were varied once more in 
terms of the associated prefaces (I think that, I believe that, I feel 
that) to determine their possible effect upon the unidimensional 
characteristic of the scale. Sealogram analysis was applied to the 
sets of responses from each testing occasion and to the sets of re- 
sponses under three of the four experimental conditions (the con- 
trol condition уаз not included). 

The third i of the research was based upon a noted opera- 
tional similarity Detween the concepts of "level of aspiration" and 
"attitude" (Haller and Miller, 1963). A set of sixteen "aspiration to 
activity" items was developed by attempting to also restate the 
Sixteen opinion statements employed in the present study in terms 
of the action the subject was willing to take in line with his ex- 
pressed opinion.’ This set of items was designed with six possible 


4The four items in this “altruism” scale were: (1) I’ve got problems enough 
of my own without having to worry about other people’s problems too; (2) 
Things go best when everybody minds his own business; (3) People should 
take care of their own problems; (4) Nowadays, people depend on each other 
too much. This scale was used in unpublished research on student attitudes 
toward Negroes carried out at Michigan State University by a graduate 
seminar in sociology. It was found to scale by the Guttman criteria 
(C.R. = 936, N = 55), but there is little evidence to support the validity 
of the scale, 
я 5 For example an opinion statement from the original sixteen items was: 
Communist should be allowed to speak on campus." The aspiration-to- 
activity item which corresponded to this opinion statement was: "If I could 
I would like to have: (1) a communist teach a course on campus; (2) à com- 
munist lecture in a course sometimes; (3) a communist make speeches on 
campus; (4) instructors teach about communism; (5) books available about 
communism; (6) communist influence kept off campus. 
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responses to correspond to the six point Likert scale used with the 
opinion items. It was expected that there would be a positive corre- 
lation between the opinion item scores and the aspiration-to- 
activity scores. However, the central hypothesis in this part ЪЁ the 
investigation was that correlations between conditions 1, 2, 3, and 
4 scores and the aspiration-to-activity score would not be signifi- 
cantly different from one another. The ai ; ration-to-activity items 
were administered as a part of the last tel: би the series. 

Forty-three subjects were tested on fouKl|."erent occasions. The 
first three tests were given a week apart, and the fourth test was 
given two weeks later. * 


Analysis of Findings 


Hypothesis 1. The ranges and means for both the four test and 
condition scores were computed, and a comparison of these statis- 
ties, shown in Table 1, demonstrates no differences in range or 
means of the subjects’ responses to the items. The range of scores 
seems to be essentially the same for tests and conditions, and none 

_ of the means shown in Table 1 differ significantly from any of the 
other means. 


TABLE 1 
Ranges and Means of Test and Condition Scores N = 48 
RANGES MEANS 
Test Condition Test Condition 

1. 40-61 40-62 50.0 49.9 
2 38-59 40-61 49.2 49.0 
3 40-61 40-61 49.7 49.5 
4. 43-60 40-59 50.2 50.5 


To further test Hypothesis 1 two correlation matrixes (produet- 
moment) were constructed—a matrix for the four test scores and a 
second matrix of the four condition scores, This enabled us to com- 
pare the correlations among the condition scores with those among 
the test scores. 

A comparison of the correlations of both test and condition 
scores (Tables 2 and 3) indicates that the results are essentially 
identical. Manipulating the cognitive and affective orientations 
does not seem to lead to discriminably different results. These 
findings shown in Tables 1, 2, and3 seem to support Hypothesis 1. 


_— 
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TABLE 2 
Correlation Matriz of Test Scores 
Test Scores 
y 1 2 3 4 
1 -60 .64 .62 
2 х .68 .61 
LAE bu .65 
SS )\_ _—= —-————————-——Є———==———— „> 
, 
d TABLE 3 


Correlation Matriz of Condition Scores 


Condition Scores 
————————— A Л 
1 2 3 4 


1 .58 .69 75 
2 .67 .58 
3 71 


Hypothesis 2. The scalogram analysis of the responses to the four- 
item Guttman scale showed coefficients of reproducibility of .919, 
913, .907, and .907, respectively for the four tests im the series. 
The coefficients for the four items under the three experimental | 
conditions (I think that, I believe that, and I feel that) were .884, 
878, and .873, respectively. The uniformity of the coefficients of 
reproducibility for each of the tests and under each of the three 
conditions indicates that variation of the phrases did not effect the 
unidimensional nature of the four-item scale. The relatively lower 
coefficients obtained from the experimental conditions can be ex- 
plained by the method by which the condition scores were derived 
(ie. the condition score is the summation of scores from four 
items given in each of the four separate testing occasions while the 
test scores were derived from the same testing occasion). The co- 
efficients for the tests were .91 and .92, and the coefficients for the 
three conditions were .87 or .88. This substantiates the second hy- 
pothesis that the unidimensionality of a particular scale will not be 
affected as a result of manipulating the cognitive and affective ori- 
entations of scale items. 


Hypothesis 8, This portion of the analysis was based on thirty-six 
Cases since seven subjects failed to complete the aspiration-to- 
activity items. The correlations (product-moment) between the 
four Condition scores and the aspiration-to-activity score were 
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203, 204, .205, and .011, respectively. The uniformity of the cor- 
relations among the first three condition and the aspiration-to- 
activity scores supports the hypothesis, although the correlations 
between opinion statements and degree of aspiration-to-activity 
are moderate at best. The lack of correlation between the fourth 
condition (items without prefaces) and the aspiration-to-activity 
score was unexpected, especially in view of the fact that the four 
condition scores were found to intercorrelate equally well. (See 
Table 3) The difference in results, however, does not seem to stem 
from the cognitive or affective orientations of the items, but from 
the presence or absence of a preface of some kind. Judging from 
the correlations with the external criterion (aspiration-to-activity 
score) above, it would seem that the addition of some preface such 
as “I think that... ,” “I believe that... ,” or “I feel that...” may 
tend to elicit a more valid response to items. That is, items with- 
out this type of preface may elicit attempts on the part of the sub- 
jects to judge the factual validity of the statements rather than to 
respond in terms of their personal opinion. This suggestion, while 
not altogether consistent with the uniform results shown in Table 
3, points up the need to study further the effect of having items 
prefaced or not prefaced by such phrases as those used in this re- 
search, Generally, the analysis seems to substantiate Hypothesis 
8 that similar patterns of correlations will be found when a set of 
cognitively oriented and affectively oriented items are compared 
with an outside criterion. 


Summary 


Such phrases as “I think that . . . ,” “I believe that . . . ,” 
or “I feel that . . ." do not seem to elicit differential responses 
when associated with opinion statements in paper-pencil question- 
naires. A comparison of means, ranges, and correlations of both 
the test and condition items indicate no effect from such variation. 
Variation in the cognitive and affective orientations of items on a 
four-item Guttman scale did not seem to affect the unidimensional 
nature of the scale. Finally, similar patterns of correlation were 
found when cognitive and affective items were compared with an 
outside criterion. 

If the use of such phrases in opinion tests do not elicit differ- 
ential responses then there is some evidence to Suggest that the 
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underlying processes which these phrases represent may also be 
interchangeable rather than functionally independent. 
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A SIMPLE METHOD FOR OBTAINING 
SOCIAL DESIRABILITY VALUES 


WALTER B. SIMON: 
Northampton State Hospital 
and 
University of Massachusetts 


Tum social desirability (S-D) dimension has attracted a good 
deal of attention in recent years. The relevant literature has been 
reviewed by Edwards (1957a, 1964). 

The typical method for obtaining S-D values has been to have 
subjects (Ss) evaluate each statement on an interval scale; some 
measure of central tendency is then computed for the scores ob- 
tained by all Ss for each statement. Thus, Edwards (1957a) uses а 
nine point scale and Cowen (1961) a seven point scale. This pro- 
cedure is somewhat tedious and time consuming. A simpler one 
would be merely to require Ss to designate each statement as 
“favorable” or “unfavorable.” The relative S-D of a statement in 
the list is based on the number of favorable endorsements it re- 
ceives, The present note reports on (1) the test-retest reliabilities 
of this method of obtaining S-D values and (2) on the correlation 
between the social desirability values obtained with the present 
method and those obtained with the usual method by Edwards 
(1957b) and by Kogan and Fordyce (1962) .? 

Ss were 100 attendants, 60 women and 40 men, at a state hospital. 
The mean age of the women was 41.6, that of the men 39.7. Ss were 


1Joel Liebowitz administered the tests and Kathleen D. Fish evaluated the 
data. Thanks are also due to Harry Goodman, Superintendent of Northamp- 
ton State Hospital, and to Florence Eaton, Director of Nurses, for facilitating 
the various phases of the study. 

2 These authors kindly made their S-D data available. 
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tested in small groups of less than 10s each. The same examiner 
conducted all the testing sessions. 

The instrument used was the Interpersonal Check List (ICL) 
developed by LaForge and Suczek (1955). The 128 items of the 
list were divided into two lists of 64 items (lists A and B) equated 
on the basis of the S-D values of the Edwards (1957b) sample. 
Each list was presented on a single sheet. 

The S-D ratings were obtained as part of a larger project. They 
followed several administrations of ICL under other instructions, 
given in the same session. Another set of S-D ratings was obtained 
approximately two weeks later, again following other administra- 
tions of ICL. Half of the Ss were given list A first, the other half 
were given list B first. Ss were instructed to designate each state- 
ment as either favorable or unfavorable and to do the judging 
rapidly. 

Test-retest reliability based on the data of 99 Ss? and esti- 
mated by tetrachoric correlations (Guilford, 1956, p. 307) was .89. 
This is not quite as high as the reliabilities reported by Edwards 
and Walsh (1963) for another set of personality items, using the 
usual method of obtaining S-D values. 

Comparisons with the Edwards and the Kogan S-D values were 
based on the proportion of F/F--U judgments for each of 126 items 
of ICL.* Since proportions for the test and retest were available, the 
mean of the two proportions was used as the score correlated. The 
product moment correlation with the values obtained by Edwards' 
(1957b) sample of college students was .94 the correlation with the 
values obtained by Kogan and Fordyce's (1962) patient sample was 
-95. This is within the same range as the intercorrelations obtained 


with the usual method among different groups of Ss (Edwards, 
1964; Cowen and Budin, 1964). 
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VOCATIONAL SELF CONCEPTS OF BOYS CHOOSING 
SCIENCE AND NON-SCIENCE CAREERS! 


ROBERT P. O'HARA 
Harvard University 


Tun importance of the self concept to career development was 
first noted by Super (1951). In a later work he formulated the fol- 
lowing hypothesis, ^Self concepts begin to form prior to adoles- 
cence, become clearer in adolescence, and are translated into occu- 
pational terms in adolescence" (Super, 1957). 

O'Hara and Tiedeman (1959) researched the clarification of vo- 
cational self concepts during the four grades of secondary school. 
They found progressive clarification of interests, work values, gen- 
eral values and aptitudes over the four grade levels. In an explora- 
tory study, O'Hara (1962a) found clarification of interests ex- 
isting at a lower level in grades seven and eight than the level 
found for grade nine in the earlier study. 

A third study showed that these same vocational self concepts 
were related to the vocational developmental task of achievement 
during the four grades of secondary school (O'Hara, 1962b). The 
most striking relationships were found in the areas of mathematics 
and science. When self concept variables were added to test scores, 
the multiple correlations with achievement were increased in eight 
out of eleven subject matter areas in grades nine and twelve. 

From these studies, which shed an indirect light on career de- 
velopment, it seemed feasible to proceed to a study of the transla- 
tion of the self concepts into occupational terms. Our general hy- 
pothesis was that there would be significant relationships between 


1The research reported in this paper was supported in part by the College 
Entrance Examination Board. 
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specific vocational self concepts in the a priori relevant areas and 
the general career orientations of Science (Sc) or Non Science 


(N Sc). 
Design of Research 


In 1958 Tiedeman, O'Hara, and Matthews wrote that future in- 
vestigations of career development should concentrate, ^(1) upon 
formalizing the rules of combination of data and establishing suc- 
cess rates for prediction, using systems of data already known to be 
related to position choice; (2) upon investigating presently de- 
signated systems of data in interaction with each other; and (3) 
upon introducing new systems of data into our consideration" 
(Tiedeman et al., 1958). 

The new system of data not hitherto used in the study of scien- 
tific careers is the vocational self concept. This report is chiefly con- 
cerned with the introduction of this system of data into the re- 
search on scientific careers. 

The writer thinks that light will be shed on career development 
if the variables are combined without regard to their separate con- 
struct entities. It would appear that the person himself operates in 
this fashion in his day to day functioning. Under the direction of a 
counselor, and with the help of tests, it is possible to organize the 
data into systems for the sake of greater understanding. But this 
organization may not be the appropriate one. Hence we shall ana- 
lyze the interaction of all the variables in all the systems in rela- 
tion to the Science-Non Science career choice. 

It should be noted that we have assigned a value of 1 to the 
choice of Science career and a value of 0 to the choice of Non Sci- 
ence career. When these data are shown to be significantly related 
to another set of test Scores, a positive correlation can be inter- 
preted to mean that the Science group scored high while the Non 
Science group scored low. A significant negative correlation is to be 
interpreted conversely. 

In our case the criterion as described is dichotomous. The corre- 
lation, therefore, is a multiple point biserial correlation. The re- 
sults of this technique are proportional to the results of a two 
group discriminant analysis (Tiedeman, Rulon, and Bryan, 1951). 
The correlation coefficients and the weights associated with them 
are interpreted as for the ordinary multiple correlation. 
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Systems of Data 


Ability—From the Differential Aptitude Test (DAT) battery 
(Bennett, Seasore, and Wesman, 1947), we chose five scales for 
our research. These scales were Verbal Reasoning, Numerical Abil- 
ity, Mechanical Reasoning, Space Relations, and Abstract Reason- 
ing. 

Interest—To measure interests we chose the Kuder Preference 
Record—Vocational (KPR) Form CH, (Kuder, 1948). 

Values—(1) The relationship of values to occupational choice 
has been shown in the many studies reported in the manual for the 
Allport-Vernon-Lindzey Study of Values (1951). At the time of our 
original study, there was not available a high school version of this 
test. The writer, with the permission of G. W. Allport, revised the 
1951 edition (RAVL) for use in this research project. (2) The ori- 
entation of the Study of Values is generic. It analyzes and sum- 
marizes the students’ general value orientations. At Columbia, Su- 
per and the staff of the Career Pattern Study had just, developed an 
instrument called the Work Values Inventory (WVI) (Super, 
1955). Permission was given for the use of an early form of this 
instrument to study the existence of values presumed to be more 
specifically related to the world of work. 


Self Concept 


The self concept in these studies has been given a relatively re- 
stricted definition. The writer holds the theoretical position that a 
person does not have one single concept of himself, but rather a 
series of concepts of himself which are relevant to the many areas 
of his living. A major aspect of these concepts of self is their evalu- 
ative character. It is with this aspect of self concept that we are 
concerned. We further delimit the self concept notion in this study 
to the systems already described. Since these are presumed to be 
related to vocational choice, we have called them vocational self 
concepts. Vocational self concepts were measured by asking the 
boys to rate themselves on a nine point scale. Definitions for these 
scales were drawn either from the authors’ manuals or from the 
tests themselves, The following is an example of the self concept 
scale corresponding to the Outdoor scale of the Kuder Preference 
Record. The students were told that the scores would not be used 


Myin- very much less same little much very my in- 
terest much less in- amount more more much terest 
ranks less interest terest of in- in- in- more ranks 
with the interest than than terest terest terest in- with the 
lowest than most most as than than terest highest 
group most people people most most most than group in 
in this people people people people most this 
field people field 
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for grading but only for counseling. The self rating scale was ad- 
ministered prior to the administration of the “tests.” 

Outdoor interest means that you prefer work that keeps you 
outside most of the time and usually deals with animals and grow- 
ing things. 


The Criterion 


The testing took place in March. In May the boys were asked to 
list the careers they were considering or had decided upon at that 
time. Assignment to the Science or Non Science Pool was made 
by the writer and two assistants. 


» 


Subjects 


The subjects of this study were boys in a private college pre- 
paratory school, which draws the student body from the Metro- 
politan Boston area. There were 979 boys in the sample; 152 Sen- 
iors, 254 Juniors, 257 Sophomores, and 316 Freshmen. The four 
grades are homogeneous by sex, intelligence, and religion by virtue 
of administrative poliey. The data also revealed that the boys in 
the four grades had similar distributions of Verbal Ability, Nu- 
merical Ability, and Social Class. 


Analysis of Data 
Systems of Data 


It is not possible to include here the detailed analysis of the 
four systems of data? Interests, work values, general values, and 
aptitudes were each shown to be in varying degrees related to the 
choice of Science or Non Science. Taken as single systems, the 
highest correlations were produced by the interest system. The 
other three systems appear to produce correlations of approxi- 


2A limited supply of copies of the complete report is available from the 
College Entrance Examination Board. 
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mately the same order, except that in grades nine and ten the apti- 
tude system correlations are somewhat lower. It would not be ex- 
pected that work values, general values, and aptitudes would differ 
greatly in their rates of prediction. Interests, however, taken as a 
single system, might be expected to surpass the other three systems 
in prediction of the choice of a Science or Non Science career. 


Multiple Correlations from Test Scores and Self Ratings 


Table 1 presents, by grade level, the multiple correlations de- 
rived from a seventy second order regression equation (36 test 
scores and 36 Self Ratings). On both sides of this Table (test 
scores and test scores plus Self Ratings) the correlations with the 
criterion appear to maintain a plateau through grades nine, ten, and 
eleven. In grade twelve there are what appear to be significant in- 
creases in the multiple correlation derived from test scores, and 
from test scores plus Self Ratings. The test scores plus Self Rating 
correlation is .6947, accounting for 48 per cent of the variance in the 


choice of a Science or a Non Science career. 
° 


TABLE 1 
Multiple Correlations of All 72 Variables with the Criterion (Sc-Non Sc) 
for Four Grade Levels 
Significant Significant 
Beta Scales plus Beta 
Scales Weights Self Ratings Weights 
Grade 9 (N = 316) Grade 9 (N = 316) 
нє! р КЫРАН dem ES зы сй шь 
КРЕ Scientific .3370 КРЕ Scientific .3663 
KPR Mechanical .2572 KPR Mechanical .2426 
KPR Persuasive .2441 КРЕ Persuasive 2315 
KPR Computational .2280 КРЕ Artistic .1641 
RAVL Political —.1365 RAVL Political —.1511 
WVI Way of Life —.0975 КРЕ Computational .1486 
WVI SR-Creative —.1375 
DAT SR-Numerical .1286 
WVI SR-Material ‚1198 
WVI Way of Life —.1144 
Multiple R .5238 Multiple R .5565 
Е Ratio 16.6407 Е Ratio 13.6800 
Probability «.01 Probability «.01 
Grade 10 (N — 257) Grade 10 (N = 257) 
'KPR Scientific .2561 КРЕ SR-Scientific .2652 
DAT Spatial .1768 KPR Mechanical .1891 
RAVL Aesthetic —.1670 РАТ SR-Numerical .1826 
WVI Social Welfare —.1527 RAVL Aesthetic — 1547 
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Significant 
Beta Scales plus 
Scales Weights Self Ratings 
KPR Mechanical .1346 RAVL Theoretical 
KPR Outdoor 
WVI SR-Theoretical 
DAT Spatial 
Multiple R .5216 Multiple В 
F Ratio 18.7654 F Ratio 
Probability «.01 Probability 
Grade 11 (N = 254) Grade 11 (№ = 254) 
KPR Scientific. .9151 KPR SR-Computational 
DAT Verbal —.2634 КРК Scientific 
DAT Numerical .1900 RAVL SR-Economic 
DAT Spatial .1794 КРЕ Literary 
KPR Literary —.1433 KPR SR-Clerical 
WVI Security .131 RAVL Social 
RAVL SR-Religious 
WVI SR-Work Conditions 
WVI Work Conditions 
Multiple R .5020 Multiple В, 

Г Ratio 13.8666 Е Ratio 
Probability «.01 Probability 
Grade 12 ( = 152) Grade 12 (№ = 152) 

KPR Scientific .3894 КРК Scientific 
KPR Artistic -1855 ҰҮІ Material 
WVI Aesthetic —.1800 DAT SR-Verbal 
RAVL Political —.1745 КРКЕ SR-Persuasive 
KPR Computational .1703 WVI Mastery 
WVI Work Conditions 
RAVL Political 
KPR Artistic 
KPR SR-Mechanical 
WVI SR-Association 
DAT SR-Numerical 
Multiple R .6464 Multiple R 
Е Ratio 14.7674 Е Ratio 
Probability <.01 Probability 


Significant 
Beta 
Weights 


.1465 
.1406 
—.1351 
.1172 
.5755 


15.3501 


<.01 


.9277 
.2152 
.2120 
—.1959 
—.1761 
.1737 
—.1525 
.1406 
.1092 
.5749 


13.3875 


<.01 


.3786 
.2201 
.2110 
—.2026 
.1726 
—.1664 
—.1639 
.1445 
.1439 
.1430 
.1405 
.6947 


11.8749 
<.01 


It is clear from Table 1 that when all test score systems were 
combined, each was found to contribute at some grade level some 
important variable to the making of the multiple correlation. But 
when all test scores and all Self Ratings were combined, the apti- 
tude test score system was found not to contribute a single variable. 

There are a total of thirty variables related to choice of Science- 
Non Science over the four grade levels. Only six of these occur in 


E 
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more than one grade. The КРЕ Scientific Scale occurs in grades 
nine, eleven, and twelve; the KPR Mechanical in grades nine and 
twelve; the SR Numerical in grades nine, ten, and twelve; and the 
WVI Work Conditions in grades eleven and twelve. The pattern of 
variables differs at every grade level. 

This is an unexpected finding. We can only speculate about its 
import. Our speculations are influenced by our conception of ca- 
reers in terms of development. We think that the varying patterns 
for the grade levels could result from the presence in the Science 
pool of boys who will eventually drop out. 


Importance of Self Rating Variables 


It was not found possible to test for the significance of the dif- 
ference between correlations produced by diverse sets of variables. 
The unique contribution of the Self Ratings to the multiple correla- 
tion was found by first calculating the R for the test scores, and 
then the R for test scores plus Self Ratings. 

Table 2 shows clearly that the inductively chosen gelf Ratings 
are contributing significantly to the multiple correlation. 


TABLE 2 


Significance of the Difference between Multiple Correlations for Tests and Multiple 
Correlations of Test Scores plus Self Ratings for All Significant Variables 


Dependent Variable—Science-Non Science 


Test Scores 
Test Scores plus Self Ratings 
Grade R R Е P 
9 .5238 .5565 4.15 <.001 
10 .4803 .5755 4.79 <.001 
11 .4186 .5749 6.28 <.001 
12 .5987 .6947 6.73 <.001 


Prediction of Membership in the Science and Non Science Groups 


To determine the concurrent effectiveness of the four prediction 
equations for the four grade levels, discriminant scores were cal- 
culated for each student. The variables used in the equations were 
those listed on the right hand side of Table 1 as significant fac- 
tors. Table 3 contains the hit and miss rate in percentages. 

An analysis of the misses will help in the interpretation of the 
summary data in Table 3. Eleven boys in grade twelve made 
choices categorized as scientific, but were predicted as Non Science. 
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TABLE 3 
Rates of Success in Prediction of Membership in Science and Non Science Groups 
Science Non Science 
Hit Miss Hit 
Grades N % N 96 N % N % 
12 61 82 13 18 67 86 11 14 
11 108 83 22 17 89 12 35 28 
10 106 78 30 22 80 66 41 34 
9 128 78 36 22 113 74 39 26 
Totals 
Hit Miss 
Grades N 96 N % 
12 128 84 24 16 
11 197 78 57 22 
10 186 72 71 28 
9 241 76 75 24 


These choices were as follows: four doctors, four accountants, an 
aeronautical engineer, an electrician, and a physicist. The last 
three are probably real misses, 

Although the assignment of accountants to the Science pool was 
based largely on their need for a knowledge of mathematics, the 
appearance of these four misses seemed to cast doubt on such an 
assignment. Since, however, there were six accountants in the Sci- 
ence hit pool, the correctness of the assignment is debatable. 

There were ten doctors in the Science hit pool. The four doctors 
missed in the Non Science pool are most probably real misses. 

Thirteen boys in grade twelve made choices categorized as Non 
Scientific, but were predicted as Science. These choices were as fol- 
lows: five seminarians, three military career men, a pilot, a physi- 
cal education teacher, a public relations man, a lawyer, and а 
business man. 

Since there were eleven hits scored by assigning seminarians to 
the Non Science pool, the five misses would appear to be real. The 
issue, however, is clouded by the fact that if these young men enter 
religious orders, it is possible for them to pursue careers in science. 
In such cases, these should be classified as hits. If they opt for the 
secular priesthood, a science career is a very remote possibility. 

Five military career men assigned to the Non Science pool were 
hits. Since there were also these three military career men who 
were misses, the assignment to the Non Science pool is debatable. 
Probably the same scientific career option open to the religious 
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order seminarians is open to the military career men, and thus 
they may also be considered as hits. 

If the hypothesis about the five religious order seminarians and 
the three military career men is valid (i.e., these could follow sci- 
entific careers), the data in Table 3 would be modified as follows: 
sixty nine scientific hits (93%) and five scientific misses (7%). 
This kind of exception to an original classification may be more 
common than these data indicate. It involves a branching opera- 
tion, (Tiedeman, O’Hara, and Matthews, 1958). As careers become 
more specified, branching becomes the rule rather than the excep- 
tion. Under these conditions, the variables used in prediction and 
counseling will probably also vary. 


The Process of Translation 


One might conceive of the translation process as the recognition 
that a given trait is related to the world of work (Super, 1963). In 
the early years, the trait may be possessed, used, or enjoyed in it- 
self. The realization that this trait is needed for a given career may 
come suddenly, through some striking experience, or it may be a 
slow unfolding, based on а cumulating series of career related suc- 
cesses, perhaps failures. 

The writer would hesitate to hold that self concepts, which are 
de facto related to the world of work, are always seen initially in 
an unrelated way. Our culture is work and career oriented. Our 
children are exposed to the work environment through all the 
communications media at the age of two or before. “Hi diddle, 
diddle, the cat and the fiddle . . .” is now accompanied or displaced 
by “Make way for the Thruway.” Thus it is at least probable that 
the translation process, in a rudimentary form, begins very early, 
and perhaps concomitantly with some initial self perceptions. 

O'Hara and Tiedeman (1959), studied the clarification of self 
concepts in areas presumed to be related to career development. 
The presumption was based on previous research studies in the 
field. In the 1959 study, clarification is clearly shown by the in- 
creasing relationship between self concepts and test scores. Trans- 
lation into occupational terms is not necessarily present. However, 
when the self concept not only clarifies but is increasingly related 
to Science and Non Science choices, then we would seem to be 
justified in saying that these are vocational self concepts and that 
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there is evidence here that will help in understanding the transla- 
tion process. 


The Criterion Problem in Research and Counceling 


This research, the two previous studies (O'Hara, 1962a, 1962b), 
together with those of Cooley (1963), and Herriot (1961), have 
convinced the writer that, for both guidance and research purposes, 
the short term consideration of a broad objective is more appro- 
priate for the adolescent level. 

Assuming that adolescence is a time of clarification rather than 
specification, both counselor and client should be interested in de- 
limiting areas of positive and negative choice. Table 1 indicates 
what appear to be important variables in delimiting a Science or 
Non Science choice. More specific criteria are valid during ado- 
lescence for more specific aspects of vocational development, e.g., 
the vocational developmental task of scholastic achievement. 

The difference between these two aspects of the career develop- 
ment problem is seen in the differing set of predictors resulting 
from the use of the different criteria. The most striking difference 
is the relative weight given to the aptitude and self rating of ap- 
titude. When the self ratings of aptitude are used to predict high 
school achievement they are of the utmost importance (O’Hara, 
1962b). When these same aptitude self ratings are related to the 
choice of a Science or Non Science career, their importance is 
greatly diminished. 

Recognition of varying criteria for varying levels of vocational 
development is good for guidance purposes because it allows for 
growth and change. It shifts the focus from early specification to 
clarification and development. The pressure for early specification 
comes from several sources. Upward mobility is a generalized 
value in our society. One way to do better is to start earlier. Thus, 
early specification of career becomes a kind of criterion for the 
success of the guidance department. What evidence we can inter- 
pret to date seems to indicate that early career choice is the excep- 
tion rather than the rule. Nor is it clear that the gradual process 
of increasing specification during the adolescent years can be hur- 
tied up in such a way that benefit will accrue to the individual and 
to society. 

If clarification and development are more characteristic of the 
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choice process in the teens, then researchers should accentuate this 
in their investigations. It would seem that such an approach would 
bring the research model closer to what appears to be the reality of 
what we are trying to measure. 
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RzsrancHERS in psychology and education frequently encounter 
variables whose distributions show pronounced departures from 
normality. Although many statistical techniques (the multivariate 
procedures in partieular) assume normally-shaped distributions of 
variates, it is often a difficult and laborious procedure to trans- 
form non-normal distributions into normally-shaped ones. Investi- 
gators typically apply some of the standard transformations (e.g., 
log, inverse sine, square root) in a more or less trial-and-error 
fashion, but often the anomalies are such that none of these will 
produce a satisfactory result. 


Purpose 


"This report describes a computer program which will transform 
a set of variables into normalized distributions of standard scores. 
The program yields a punched output of normalized standard 
Scores (with provision for ID numbers) and also a listing of the 
raw score and transformed score equivalents for each variable. 
Any combination of subjects x variables up to 31,500 can be ac- 
commodated. For those investigators who may wish to prepare 
data arrays for use in “obverse” multivariate procedures, the pro- 
gram contains an option for punching the transpose of the matrix 
of normalized standard scores. The program is written in Fortran 


2 Portions of this program were written by AWA and MS while they were 
with the National Merit Scholarship Corporation. 
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IV for use with the CDC 3600 and 3800 computers (Scope Moni- 
tor)? 


Method 


The method used by the program is the most generally effective 
technique (and perhaps also the most laborious one) for normaliz- 
ing any frequency distribution. Each observation is first ranked 
(1 — N) and then converted into percentile scores. Percentiles are 
computed according to the following formula: P, = 100 — 100 
(Rs + .5)/N; where P, is the percentile rank for subject S, R, is 
the rank of subject S among all subjects in the distribution 
(1 = highest-ranked), and N is the total number of subjects. Since 
each score is rounded to the nearest tenth of a percentile point, only 
999 different scores (00.1-99.9) are possible in a given distribution. 

The program next consults a table of percentile ranks and cor- 
responding standard scores from the normal probability curve (this 
table has been generated in core by the use of DATA statements 
in the program) and substitutes the appropriate standard score 
for each percentile score. The resulting standard scores have a 
mean of 500 and a standard deviation of 100, or, if three-place 
accuracy is not desired, a mean of 50 and a standard deviation of 
10 (1.е., the familiar T score). 

The normalized standard scores can be punched in integer form 
in any desired output format, with two restrictions: the order of the 
variables must be the same as their order during input, and the ID 
number must always occur first on each punghed output card. The 
printed output shows each raw score, the rank (1 — №) of the 
raw score in the distribution, and the normalized standard score 
assigned to the raw score. A special diagnostic message is also 
printed out whenever more than 100 ties occur on any raw score. 


? For potential users of this program we have obtained the following list of 
university-based CDC 3600 or 3800 systems: University of California (Law- 
rence Radiation Laboratory), Livermore 3gUniversity of Michigan, Ann Arbor; 
University of Paris, France; Tata Institüe, Bombay, India; University of 
California, San Diego; University of Wisconsin, Madison; Indiana University, 
Bloomington; University of Massachusetts, Amherst; Uppsala University, 
Sweden; and Blaise Pascal Institute, Paris, France, Commercial CDC 3600 or 
3800 systems can be found at the following places: 8100 34th Avenue South, 
Minneapolis, Minnesota; 5630 Arbor Vitae, Los Angeles, Calif.; 11428 Rock- 
ville Pike, Rockville Md.; 575 Lexington Avenue, New York, N. Y.; 3330 
Hillview Avenue, Palo Alto, Calif.; 7015 Gulf Freeway, Houston, Texas; and 
545 Technology Boulevard, Boston, Mass. 


^ 
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Ties are handled by setting each tied observation equal to the 
mean of the normalized standard scores corresponding to the ranks 
spanned by the ties. For example, if five subjects were tied for 
tenth-ranked in a distribution, the mean of the five standard 
scores that would normally be assigned to ranks 10, 11, 12, 13, and 
14 would be computed and assigned to each of these five subjects. 
The next-highest score would then be considered as ranking 15th. 


Input 


The program requires that the input data be in the form of in- 
tegers tot exceeding 115 in magnitude, although the variables can 
be of varying size and in any format. Data are read in one subject 
at a time. Each subject must have an ID number of at least two 
columns in length, and the number must occur before any of the 
variables on the first input data card. Since the ID number is 
actually read in by the program as two separate integers, the 
number can range up to 30 digits (2115) in length. 


Control Cards: 
1. Title Card—Alphameric characters in columns 1—72 (re- 


quired), 
2. Parameter card (all integers should be right-adjusted): 


Column Content 

14 Number of variables. 

5-8 Number of subjects. 

9 Number of"variable format cards for input (limit of 4). 

10 Number of variable format cards for output (limit of 2). 

11 “3” indicates that 3-place accuracy is desired in the punched 
output; otherwise two columns will be set aside for each normalized 
variable. 

12 Number of cards of punched output (1, 2, or 3) (should agree with 


column 10, above). 
13-14 Number of variables desired in first punched output card. 


15-16 Number of variables desired in second punched output card. 
17-18 Number of variables desired in third punched output card. 
19 Any positive Ux ота matrix is also desired in punched 
output?; if not desired, leave blank or zero. 


20-21 Number of variable name label cards (limit of 15). Names are read 
consecutively from cards as 10A8. 


——— 


з Format for each card of the punched transposed matrix is: 14 (“subject” 
ID number), [4 (deck number), 24I3 (data). The "subjects" are, in this case, 
the original variables, and their ID's are the sequential numbers of the 
variables in the initial input data. 
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The format of the input data and the desired format for the 
punched output should be indicated in conventional variable for- 
mat style. All variables should be integers (I), and the statement 
should be enclosed in parentheses. It is important to keep in mind 
the following considerations: 

1. The first two variables read from the first input data card 
and the first two variables punched on each output card will be the 
two ID numbers. These must be accounted for in both variable 
format statements. (Note: the fields in columns 1-4 and col- 
umns 13-18 of the parameter card (above) should not include the 
ID numbers.) In order to avoid the possibility of imbedded blank 
columns (instead of zeros) within an ID number field in the 
punched output, it is recommended that the “second” ID number, 
wherever possible, be read in and punched out as I1. Thus, a four- 
digit ID number would be read in and punched out as I3, Il. 

2. If more than one punched output card per subject is used, 
the program will also punch out a “deck number" (1, 2, or 3) as 
the last variable on each card. This number must be allowed for 
(I1) on the variable format output statement. 


"The final order of cards should be: 


Binary deck (with appropriate systems cards) 

Title card 

Parameter card 

Variable format card (s) for input (cols. 1-72 only) 

Variable format card(s) for punched output (cols. 1-72 only) 
Label card(s) (if any) 

Data 

Title card for second job 

etc. 


A binary deck is currently available upon request to the Office 
of Research, Ameriean Council on Edueation, 1785 Massachusetts 
Avenue, N.W., Washington, D. C. 20036. 

In some multivariate studies we have observed that results ob- 
tained with the normalized variables may be distinetly superior to 
results obtained with the non-normalized data. Factor structures 
are sometimes clearer, and multiple regression coefficients occa- 
sionally higher. In one study, for example, a multiple R of .80 
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(N = 1,000) rose to .94 when normalized variables were used in- 
stead of the non-normalized variables. (In both analyses the cri- 
terion variable was unaltered.) We should like to hear from other 
investigators who may also have had an opportunity to compare 
the results of multivariate analyses performed on normalized and 
non-normalized data. 
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SCORING, ANALYZING, AND REPORTING CLASSROOM 
TESTS USING AN OPTICAL READER AND 1401 
COMPUTER 


C. DEAN MILLER, MARILYN DOIG, AN» GEORGE MILLIKEN 
Colorado State University 


Recent? technical advancements including the design of optical 
readers have greatly facilitated the use of computers in scoring and 
analyzing classroom tests. The test-scoring service and Statistical 
Laboratory at Colorado State University have developed a series 
of experimental programs to test the feasibility of using an IBM 
1231 optical reader and IBM 1401 computer in scoring tests. The 
programs were designed to provide: (a) accurate and efficient 
scoring of tests, (b) item analysis to improve test construction, 
(c) individual reports to students and faculty, (d) references to 
areas to be reviewed, and (e) maintenance of grade registers for 
large classes and computation of final grades using computers. 

Instructors at Colorado State University may use stock IBM 
1230 or 1231 answers sheets which are fed directly into the optical 
reader and computer at the rate of 600 to 2000 an hour, depending 

| оп the type of analysis and format used to report results. The 
| programs, which have several options, accommodate a wide range 
| of tests. An answer sheet containing 150 items with five response 
positions may be fed directly into the system which will complete 
the scoring, analysis, and printing of an individual report or the 
punching of a card for each student. Special marking pencils are 
not required. Students record answers with a regular pencil. 

A uniform set of statistics is computed and printed for each set 
of test papers, ineluding number of papers graded, mean, standard 
deviation, and distribution and frequency of scores. An analysis 
consisting of frequency and percentage of students choosing each 
alternative and a point biserial correlation coefficient is computed 
and printed for each test item. In one option, the program com- 
putes the correlations between test items and total test scores, re- 
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jects all items below a specified correlation coefficient, and re- 
computes each student's score based on the items retained. 

It is possible to mark one or more response positions on the 
key for each item and score as right or wrong items containing 
one or more marks. For example, on a botany test a student may 
be required to mark positions 1, 4, and 5 on an item in order to get — 
credit for the question. In scoring each item, any combination of | 
marks or blanks other than the ones indicated by the key will 
cause the computer to score the response(s) as being wrong. Mul- 
tiple marks are scored as wrong on all items keyed with single 
responses. It is not possible to give partial credit for an item re- 
quiring multiple marks for a correct response. The greatest limi- 
tation is that an item may have only one response marked for a 
correct answer in order to complete the item analysis. 


Output 


Each of the three programs prepares an identical printed re- 
port for the instructor. A sample of the printed report which is 
common to all three programs follows: 


MILLER, C.D. EDUC. PSYCH. 2/17/66 6828 


1.00 RIGHT MINUS 0.25 WRONG* 
THIS TEST CONSISTS OF 065 ITEMS 


ВР NOB d Ашы eeu. CORRECT 065 WRONG 000 BLANK 000 SCORE 65.00 
ST. NO. 000168*____ CORRECT 038 WRONG 020 BLANK 007 SCORE 33.00 
ST. NO. 000981 ____CORRECT 043 WRONG 021 BLANK 001 SCORE 37.7 
ST. NO.001482 ____ CORRECT 046 WRONG 019 BLANK 000 SCORE 41.5 


THERE WERE 124 PAPERS GRADED 
M = 44.50 STANDARD DEVIATION = 7.32 
GRADE DISTRIBUTION 
RAW SCORE FREQUENCY 
59 2 


58 1 
25 1 
23 1 
PARAM 06510025 MILLER, C.D. EDUC. PSYCH. 2/17/66 68284 
QUESTION FREQUENCY PERCENT 
0, 1. 2 3 M БЕ eR BIR — Q— 1 5 
Ок a 1 
2. 4 25 5 88* 2 .21 04120 49212719 2 
3. 9 115* .05 т 93* 
* Scoring formula used, 1.00 X NO. right minus 0.25 X NO. wrong. ; 
ъ This is the score for the second key which is used to check the of the first key fed into the сошроіќ 


* Four blanks may be used by the instructor for additional ID information, 
4 Parameter card used to set the computer and to identify the output is listed, 
(NOTE: * MARES THE KEYED ANSWER, AND "0" MEANS LEFT BLANK) 
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One experimental program scores the answer sheet and uses a 
subroutine to look up the student's name on a tape record from 
which the computer punches an IBM card and prints a record as 
follows: 

WHITMER GARY LEE 033000 RIGHT 38 WRONG 14 BLANK SCORE 72 LAB 08 


TAYLOR TOBERT AUGUSTUS 001888 RIGHT 43 WRONG 7 BLANK SCORE 86 LAB 08 
HUNSAKER LYLE WAYNE 008931 RIGHT 29 WRONG 19 BLANK SCORE 58 LAB 07 


The IBM cards are interpreted and sorted by lab number and 
returned to the students through their regular lab section. The 
instructor retains the printed output for his record. Two cards 
may be punched with one card being retained for computing final 
grades. 

Another experimental program was developed to provide infor- 
mation which enables the student to review the areas in which he 
missed one or more questions. Correlations between test items and 
total test scores are computed and items below a specified cor- 
relation coefficient are rejected. Using the items which are re- 
tained, a new score is computed and the following output is printed 
for each student: 


JANICE BARNETT 000981 PAGE 1 


SCORE = 48.0 STANDARD SCORE = +1.18 
THERE WERE 53 QUESTIONS GRADED. 

I SUGGEST THAT YOU STUDY THE FOLLOWING MATERIAL: 
REVIEW THE MATERIALS ON PROPAGANDA (CHAPTER 16) 
REVIEW THE MATERIALS ON SOCIAL MOBILITY (CHAPTER 14) 


In addition to the student’s individual report, the instructor 
receives the following printed output based on the items retained: 


NAME NUMBER SCORE STANDARD SCORE 
GAY MALONEY 022622 74.1 .22 
ROBERT ANDERS 023953 52.8 —1.33 
BOB PEPLER 027552 88.5 .90 
| 107 MISSED BY 107 STUDENTS 7 QUESTIONS 


REVIEW THE MATERIALS ON THE DETERMINANTS OF SOCIAL CLASSES 
AND THE FACTORS CONCERNING THEIR COMPOSITION (CHAPTER 11). 
012 MISSED BY 012 STUDENTS 1 QUESTION 
REVIEW THE MATERIALS ON THE LEADERSHIP OF CROWDS (CHAPTER 16). 
The numbers 107 and 012 inform the instructor that this many 
of the 135 students taking the examination missed one or more of 
the questions in each of the respective areas. The number of ques- 
tions included in each area is printed on the same line as the 
number missing one or more questions. 
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Administering the Program 


Proper marking of answer sheets is the most crucial factor in the 
entire operation. Close supervision by the instructor during testing 
is very important, Electrographie pencils and pens cannot be used 
successfully. Ordinary lead pencils with No. 2 lead are recom- 
mended and have proven to be highly satisfactory. Processing be- 
gins with the reading of the answer sheet by the 1231 optical reader, 
and with two exceptions is virtually completed by the computer 
without operator intervention. These exceptions are the following: 
(a) stray marks in the timing mark area will cause the sheet to be 
selected and not processed, and (b) punched cards must be in- 
terpreted and sorted into lab sections. 

Tt is necessary to prepare only one parameter card for each set 
of answer sheets to be scored. This eard contains: (a) the num- 
ber of items to be scored, (b) the scoring formula which has a max- 
imum value of 9.00 times either/or rights or wrongs, and (с) iden- 
tifying information which is printed on the output. 

A cardboard stencil is used to prepare a key for the instructor. 
This makes it possible to score by hand poorly marked sheets or 
sheets which would not feed because of stray marks. It also en- 
ables the instructor to spot check the overall accuracy of the scor- 
ing. Less than one-half of one per cent of the sheets fail to feed 
as a result of poor making or stray marks. Poor erasures and ex- 
traneous marks result in lower scores for students. This in turn 
puts the responsibility for making good marks directly on the 
students. Processing is not disrupted by poor marks or erasures in 
the response area. 

The overall accuracy of scoring has been increased over the 
IBM 805 electrographic system. The ease in distributing supplies 
and efficiency in handling and scoring sets of test papers along with 
additional information such as item analysis and frequency dis- 
tributions have resulted in rapid growth and expansion of the reg- 
ular test-scoring service. Our experience has indicated that input 
from 1231 document directly into a computer is more efficient and 
superior to mark sense card input. Students’ responses to the ex- 
perimental programs appear to be favorable, and plans are being 
made to make the three experimental programs available to in- 
structors teaching large classes. 
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T'est Scoring as a Service to Faculty 


The present test-scoring system at Colorado State University 
has evolved from an IBM 805 system in operation from 1950-63 
through an experimental program based on the system reported 
by Stafford and Bianchini (1963) to the present system which 
utilizes an IBM 1231 Optical Reader and IBM 1401 Computer 
with 12K storage, two tape drives, and two 1811 disk drives. The 
major objectives in the experimental programs were: (a) accurate 
and efficient handling and scoring of tests, (b) feedback to students 
and instructors by the next class period, (c) item analysis to im- 
prove the construction of test items, (d) an individual report for 
each student indicating areas to review, and (e) maintenance of 
grade registers for large classes and computation of final grades 
using computers. The last objective is the only one which has not 
been tested with satisfactory results. 

Success in the experimental phases was directly related to the 
encouragement and support of interested faculty members along 
with the assistance of very competent staff and programmers in 
the Statistical Laboratory. The critical aspect of marking has 
been the major source of problems. A few instructors have failed 
to provide for adequate supervision of marking. Approximately 
60,000 answer sheets were scored fall quarter, 1965, with indica- 
tions that the number of users is continuing to increase. 


Summary 


A series of experimental programs was developed to test the 
feasibility of using an optical reader and computer to score, ana- 
lyze, and report the results of classroom tests. Students record 
answers with a regular pencil on a stock IBM 1230 or 1231 an- 
swer sheet which contains 150 items with five response positions 
for each item. Single and multiple marks may be scored, but the 
item analysis is limited to single marks only. 

The various programs will score the tests and report the scores 
in one of several formats, including individual reports for students 
and listings for the instructor. Test analyses include computation 
of mean, standard deviation, number of papers graded, distribu- 
tions of three types of scores, point biserial correlation coeffi- 
cients between test items and total scores, and item analysis which 
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provides both frequency and percentages for each response posi- 
tion for each item on the test. 

Further development and expansion of the experimental pro- 
grams in addition to the established scoring services are antici- 
pated. Accuracy of scoring and efficiency in handling test materi- 
als have been noted by faculty members who have used the service. 
This system represents another improvement in the use of com- 
puters for scoring and analyzing tests and reporting the results to 
assist instructors in improving test construction and to identify 
for each student areas to be reviewed. 
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AN OBJECTIVE TEST ANALYSIS 
PROGRAM UTILIZING THE IBM 1230 OPTICAL 
MARK SCORING READER 


HOWARD H. ENGLISH лмо CATHLEEN M. KUBINIEC 
State University of New York at Buffalo 


Tum Objective Test Analysis (OTA) Program described in this 
article is designed to aid in the evaluation of multiple-choice ex- 
aminations and the assignment of grades. An IBM 1230 Optical 
Mark Scoring Reader with an on-line 534 card puneh enables a 
card(s) containing an identification number, score(s), and re- 
sponses to each item to be punched for each test paper, as the pa- 
pers are being processed by the 1230. Hence, answer sheets can be 
scored upon receipt and returned to the instructor immediately. 

The OTA Program is composed of four major sections, provid- 
ing measures of central tendency and dispersion, a frequency dis- 
tribution of scores, an item analysis, and a class list of raw and 


standard scores. The program is described below. 


Description of the OTA Output 


Part A: General Description 


| For each score, the following information is included: 


structor’s name, department, date. 
. Numbers of papers scored. 
. Mean 
. Median 
Mode 
. Variance and standard deviation. 


165 


оо т c 


1. Labeling information: request number, course number, in- 
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Part B: Score Distribution 


A frequency distribution is presented, with scores obtained ranked 
from lowest to highest. For each score, the following is listed: 
1. Number receiving that score (№) 
2. Number receiving that score and all 
scores below that score (CUM.N) 
3. Proportion receiving that score and 
all scores below that score (CUM.PER) 


Part C: Item Analysis 


For each item, the following information is included: 
1, Number of students who selected each option 
and number of students who omitted the item. 
2. Correct option (indicated by the symbol* 
following the value indicating the number 
of students who selected that option.) 
3. Number who attempted the item and correctly answered it. 
4. Number who attempted the item and incorrectly answered 
it. 
5. Number in “high group" (Ен) minus number in “low group" 
(Rx) who correctly answered the item (explained below). 
6. Difficulty index 
7. Discrimination index 
8. Point Biserial r 
The difficulty index, 100 (I—NR/N), provides a measure of the 
difficulty of the item for the group taking the test, where N is the 
total number of students who attempted the item; excludes omits. 
It varies from 0.00 to 100.00; as the number of wrong responses to 
an item increases, the difficulty index increases. 


RH — RL 
AQ) ? 


is obtained from index 5 above. The class is ranked relative to total 
score. The ranked scores are divided in half, producing the “high 
group" and the “оу group”. The index varies from — 5.00 to +5.00. 
Positive values indicate positive discrimination; i.e., more students 
with high total test scores (high group) answered the item correctly 
than did students with low total test scores (low group). Conversely, 


The discrimination index, 
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negative values indicate negative discrimination. A scale value of 
1.00 indicates a 10 per cent difference between the number in the 
high group and the low group who answered the item correctly; 
a scale value of 2.00 indicates a 20 per cent difference; etc. 


The point biserial r index, (URIS UE : Я 


provides an index of success on a given item compared to success 
on the total test. (MR = the average total test score of students 
who answered the item correctly; MW = the average total test 
score of students who answered the item incorrectly; p = proportion 
answering correctly; g = proportion answering incorrectly; and S = 
Standard deviation, total group.) The index ranges from — 1.00 to 
+1.00. A positive correlation for a given item indicates that those 
who got that item right tend to get a high total test score. A nega- 
tive correlation has the reverse relationship. The point at which a 
point biserial r is significant is a function of the number of students 
tested. T 

The following indices relating to the test are provided: 

1. Kuder Richardson 20 Reliability Estimate 


[pts ts 


N-1 Ss? 
2. Standard Error of Measurement S V1 — res, 
Part D: Class List 


This output includes a listing of each student's identification 
number, name, total raw score, and z and/or Т score(s). 


Procedure 
Data Obtained from Instructor 


The instructor indicates information required and options de- 
sired by completing a Request for Scoring Services Form. This 
form includes identifying information, scoring formula desired 
(В, W, or corrected for chance), number of part scores, items 
comprising each part, number of options in items comprising each 
part, and services desired (scoring; OTA-all, statistical analysis 
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only, or item analysis only; z and/or T scores), and additional 
data used for administrative purposes. 


Control Cards 


Name of Card 
Information Card 


Format Card 
Key Card 1 
Key Card 2+ 
Data Card 1 
Data Card 2+ 
New Job 

End of Job 


Columns 


14 

5-6 

T2 
12-23 
24-27 


47-48 


1-6 
1-3 


Department name 

Request number (for record keeping) 
Examination date 

Date submitted for scoring 

Number of last item used 

Number of scores 

Notation of which score is the total score 

If item analysis is desired, Y is punched; if not, 
blank 

If statistical analysis (Part A and B) is desired, 
Y is punched; if not, blank 

Z is punched if class list with Z scores is desired 
T is punched if class list with T scores is desired 
B is punched if class list with both Z and T 
Scores are desired 

P is punched if class list with both Z and T 
Scores are desired and if punched cards are 
desired 

Instructor's name; other comments 


Format of Class List desired | 
No. Lists Spacing Names? Punch | 

Ü double yes 0 | 
1 double no 1 | 
1 single yes 2 
1 single no 3 
2 double yes 4 | 
2 double no 5 | 
2 single yes 6 | 
2 double no 7 

Blank 

FORMAT 

Standard Variable Format 

MASTER 

Blank 


Key used in scoring, punched according to 
format 

DATA 

Blank 

Cards punched during scoring, according to 
previous format 

NEWJOB (indicates end of data for a particu- 
lar course) 

EOJ (follows last new job card) 
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The Mainline reads and interprets the control cards and pre- 
pares the data for processing by the report subroutines. Subrou- 
tine I generates the General Description and the Score Distribu- 
tion output. A maximum of 20 part scores, with a maximum value 
of 200 for any given score, is possible. Subroutine 2 generates the 
Item Analysis. Subroutine 3 generates the Class List. A maximum 
of 2000 students is allowed. 


Systems 


The program was written under 7044 IBSYS Fortran IV, Ver- 
sion 9, Mod. Level 3. It uses all available utility drives on the 
IBM 7044 Computer. S.SU00 is the hold tape for the raw data. 
It is merged with S.SU01, containing names and student numbers 
of all registered students. (This name tape is 42 characters in 
length, blocked as high as storage will permit.) 8.8702 is a scratch 
for the main line and sort. S.SUO3 is the decoded key; 5.810804 
is a scratch for this sort. 

The existing system occupies all but 150 decimal® positions of 
core, 

Machine time is reduced extensively by pre-sorting data cards 
in raw score order. 


Summary 


Advantages 


Use of the 1230 Optical Scanner allows data cards to be punched 
simultaneously with the scoring of answer sheets. If scores are 
needed immediately, answer sheets can be returned to the instruc- 
tor. In most cases, the OTA Printout is available within a 12 hour 
period. 

The correct answer is indicated on the Printout, providing a 
direct check on the accuracy of keying. 

Cards can be prepared and retained over the semester, from 
which a distribution of scores based on the sum of individual test 
Scores can be obtained. 

The program provides internal checks of the accuracy of con- 
trol cards and student numbers. 

Several options are available to the instructor. The scoring 
formula may be Rights, Wrongs, or Corrected for Chance. A total 
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score and a maximum of 20 part scores may be obtained. Services 
which can be requested include scoring only, scoring and the com- 
plete OTA, or scoring and any combination of the OTA sub- 
portions, The Class List may contain z scores only, T' scores only, 
or both z and Т scores; it may contain student numbers only, or 
student numbers and names; names may be listed in alphabetic or- 
der or in student number order; the list may be single or double 
spaced; one or two copies of it may be obtained; a set of punch 
cards may be obtained. 


Limitations 
The existing program is limited by the following upper limits: 


Dimension Mazimum Number 
а. students 2,000 
b. scores (part scores and total score) 20 
с. items 220 (Varies below this maxi- 


mum as a function of form of 
answer sheet used. Four 


forms are available.) 


Availability 

A printout of the source program and a copy of a manual de- 
seribing the program may be obtained by writing the authors, 316 
Harriman Library, State University of New York at Buffalo, Buf- 
falo 14, New York. The program is also on file with SHARE. 
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AN ITEM AND TEST ANALYSIS PROGRAM USED IN 
CONJUNCTION WITH THE IBM 1230 OPTICAL MARK 
SCORING READER AND IBM 534 CARD PUNCH 
ATTACHMENT 


JAMES C. MOORE: лхр RICHARD E. SCHUTZ 
Arizona State University 


Tuis paper is directed primarily at scoring operations currently 
using, or those contemplating obtaining, the IBM 1230 Optical 
Reader for scoring objective examinations. However, the FORT- 
RAN program described can be adapted to most systems and has 
provided a practical means to obtain useful information for stu- 
dent evaluation as well as statistical data for editorial revision 
and modification of test items. 

Without the IBM 534 Card Punch Attachment the IBM 1230 
is primarily limited to providing only a reliable and efficient means 
of scoring. To make the system functional for use with computer 
programs, the IBM 534 is required. With the IBM 534 operating 
on line with the IBM 1230, answer sheet data are punched into 
cards at the same time they are being scored. Thus the necessary 
data cards are provided for computer input. When the IBM 534 is 
not operating on line with the Reader, it can be used as a regular 
manually operated key punch. 

Another characteristic of the system which deserves considera- 
tion is the manner in which the IBM 1230-534 is programmed to 
Punch answer sheet data into cards. To obtain maximum use of 
card space, the system was designed to pack two questions into 
one card column. This step results in cards with multiple punched 
columns. The data have to be unpacked before they can be acted 


ee 
1 Now at the University of New Mexico. 
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upon in the operation. The program described below incorporates a 
subroutine called UNPACK which reads the data cards punched 
by the IBM 534, decodes the multiple punches into their original 
answer sheet format, and writes the data onto tape. The program 
then reads the tape to obtain the examinee's original choice to each 
question. 


Characteristics of the Program 


I. Output from the program includes: 
A. Test statistics 
1. Mean 
2. Standard deviation 
3, Reliability (KR-20) 
4. Standard error of measurement 
5. Frequency distribution 


B. Converted scores 

1. Percentiles 

2. Stanines 

3. A summary graph which includes 
а. percentiles 
b. stanines 
с. range 
d. frequency 


C. Item statistics 
1, Percentage of responses to each choice for each item 
2. Frequency of responses to each choice for each item 
3. Item difficulty index listed 
8. by item 
b. in order of difficulty’ 
4. Item discrimination index (point biserial r) listed 
8. by item 
b. in order of magnitude 


II. Limitations per problem: 
A. Maximum number of items is 148 


B. Maximum number of choices per question is five 


C. No limit on number of examinees 
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III. Card preparation and order in job deck 


A. 
B. 
C. 


System Cards 
Program 


Identification card 
CC 1-80: Alphanumeric identification information 


. Problem card 


CC 1-5: Number of items 
6-10: Number of examinees 
15: Punch a five (upper limit for choices per ques- 
tion) 


. Data cards 


These cards are punched on-line by the IBM 534 con- 
currently while the IBM 1230 is scoring the answer sheet. 
'Thus, there is one card for each answer sheet scored. 
'The cards are of the following format: 
СС 1-3: Three digit examinee code ШР (if coded 
on answer sheet) 
4-77: Item responses, packed two to a column 
78-80: Total score 


. End of data card 


CC 1-3: Punch 999 


. Variable format card 


This card specifies where keyed answer choices are punched 
on a following key card. For example, if a test consisted 
of 20 questions the format card would be punched as 
СС1... 8 

(20F1.0) 
signifying that the key card to follow has the keyed 
answers punched in columns 1 through 20, and are to 
be read in floating point mode. 


. Key card 


This card is punched with the correct, answer choices to 
the test questions. For example, if the correct answer 
to question 1 was choice 4, card column 1 of the key card 
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would be punched with a 4. Thus, the key card is a 
replicate of the answer sheet key. 


Repeat C through H as many times as desired. 


IV. Limitations of the program related to the IBM. 1230 system 
With slight modifieation the program should run on most 
systems. However, even though subroutine UNPACK is in 
FORTRAN, it is written for the CDC 3400 computer char- 
acteristics. Thus, the multiple punch unpacking will be dif- 
ferent for other systems. A series of comment statements 
printed with the UNPACK listing suggest ways in which the 
unpacking might be handled on other systems. 

A listing of the program will be supplied upon request.” 


? Available from the first author, 
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AN IBM 1401 PROGRAM FOR COMPUTING TEST SCORE 
DISTRIBUTIONS 


LEROY A. OLSON лмо ROBERT L. ROYCE 
Michigan State University 


Tuis program was designed to use output data from an IBM 
1230 optical scanner to determine test score distributions. Instruc- 
tors of large classes can obtain a distribution of raw scores as well 
as percentile ranks and standard scores. The test score distribu- 
tion program was designed for the IBM 1401 computer. , 


Input 


Input is by means of a punched card deck consisting of a param- 
eter card, a key card, and data cards. The parameter card con- 
tains a four-digit test number in the first four columns, followed 
by the number 888888 in columns five through ten. The eights 
verify that the first card in the deck is the parameter card. 

The parameter card is followed by the key card which also has 
the test number in its first four columns. The key card is identified 
by the number 999999 in columns five through ten. The total 
number of items in the test is punched in the last three columns of 
the key card. 

The data deck follows the key card. Each data card has the test 
number in its first four columns and the student or subject number 
in columns five through ten. The test score or criterion score is 
punched in the last three columns of each data card. Several tests 
may be submitted in one run. A second test deck is identified by the 
change in test number and by its parameter card and key card. 


Output 


: The printed output contains the test number, the total number of 
items, and the current month and year. The listing of occurring 


175 


16 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


raw scores is shown in the left-hand column of the print-out. The 
frequency of each score and the cumulative frequency are shown 
in the second and third columns from the left. 

The fourth column from the left shows the percentile rank for 
each occurring raw score. The percentile rank corresponding to a 
specific raw score is the percentage of students receiving lower 
scores plus half the students receiving that particular score. In 
other words, the percentile rank is the percentage of students re- 
ceiving raw scores lower than the midpoint of a given score level. 

The final column on the right shows the standard score corre- 
sponding to each raw score. The standard score is a linear trans- 
formation of the raw score distribution so that the mean is 50 and 
the standard deviation is 10. 

The mean, standard deviation and variance of the raw score dis- 
tribution are shown at the bottom of the print-out. 


Options 

The mean and standard deviation of the standard scores may be 
specified by the user. This feature is especially useful when the 
results of a test are being weighted and then combined with other 
scores. The desired mean is punched in columns 18 through 21 of 
the parameter card and the desired standard deviation is punched 
in columns 14 through 17. If no mean or standard deviation are 
specified on the parameter card, the standard scores will be conven- 
tional Т scores with a mean of 50 and a standard deviation of 10. 
The mean and standard deviation actually used in computing the 
standard scores are listed on the bottom line of the print-out. 


Restrictions 


Scores on the input cards may range from 000 to 999. There 
up be no more than 9,999 data cards. Each data deck must con- 
tain a different test number if the decks are to be batched in а 
single run. The program was designed for an IBM 1401 with 12K 
internal memory and with special features and tape units. 


Availability 
A listing of the program will be supplied free of charge. 
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USE OF DIGITAL COMPUTER AS A TEACHING AID 


Е. E. REMMENGA, У. К. HALVORSON ax» W. B. OWEN 
Colorado State University 


Ам ideal teaching procedure for statistical methods courses pre- 
sented to non-statisticians is to ask each student to produce his 
own data for each type of problem. This procedure makes the 
problems realistic and of more interest to the student, particularly 
if the data are from his own research project. This technique pre- 
sents a complex grading problem as the situations gelected by 
students will vary in difficulty; each problem must be worked out 
by the grader if the grading is to be satisfactory; and students 
may find appropriate solved problems from a number of sources 
and thus avoid performing any computations. These complexities 
can be avoided with close supervision which is possible only with 
small classes, 

At the present time, however, class size is increasing out of pro- 
portion to the number of statistics teachers as a result of the 
spreading use of statistics in many fields. The problems mentioned 
above are magnified with large classes and unscrupulous students 
seem to be more numerous. Assigning book problems or something 
similar becomes necessary, and there is no control of any kind on 
who actually performs the homework to be copied by others. In 
addition, the problem sets are dull and provide no motivation. At 
Some stage, grading becomes sloppy and students have a justifiable 
complaint about lack of critical review of their efforts. 

In an attempt to overcome this difficulty, checking homework on 
а computer was conceived and attempted in 1963-64 by one of the 
authors (W.B.O.) on a trial basis. The results were quite satis- 
factory and sufficiently interesting to promote the idea. It was not 
known whether or not such an effort had been undertaken else- 
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where, but there is information on programmed learning, com- 
puter training, computers in numerical analysis, and similar work. 
The idea seemed reasonable, and with the cooperation of the avail- 
able teaching staff and the programming ability of the authors, 
such an operation was feasible. A large amount of computer time 
for program debugging and operation was provided by the Direc- 
tor of the Statistical Laboratory, Dr. F. A. Graybill; and after 
some initial discussion and general flow charting, the system was 
begun. The system described below was planned in the summer of 
1964 and put into operation during the 1964-65 academic year 
successfully. The description will include ideas which are not as 
yet operative but which are being incorporated into the system. 
The results, while rewarding from the instructor's point of view, 
cannot be adequately evaluated on the basis of student achieve- 
ment, although it appears satisfactory. 


Procedure 


The major objective of utilizing a computer for grading was to 
eliminate a dull monotonous chore in order that the instructor 
would be able to devote more time to developing lectures and to 
recapture the rapidly disappearing student-teacher rapport. A sec- 
ond but not subordinate objective was to provide realistic, unique 
problems which could challenge and hopefully motivate each stu- 
dent. 

The first step was to provide each student a unique set of data 
which was suitable for the various problem types normally covered 
in an introductory statistics course. An arbitrary identification 
number was assigned each student. Student registration numbers 
were tried but proved too cumbersome, and a problem created by 
late registrants was more easily handled by using arbitrary 88- 
signed numbers. 

A table of normal random numbers (mean zero, variance one) 
was keypunched and used as input to the data generating program. 
A large number was added to each random number to eliminate 
negative numbers. This was not a necessary requirement, but it 
avoided early confusion and also served to make the data appear 
real. This number was a constant but could easily have been а 
random number to increase the uniqueness of each data set. Addi- 
tional constants were added to each of 12 samples of five numbers. 
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The general pattern was to have four samples with the same mean, 
four samples with an increasing but not significant trend, and four 
samples with a significant increasing trend. This gave each student 
60 items of data in 12 samples of five. By taking samples and 
combinations of samples, data for t-tests, 1, 2, and 3 way analyses 
of variance, and polynomial regressions each with significant or 
nonsignificant results were readily available. Two copies of the 
data were printed, one for the student and one for the instructor. 
The entire data set, identified by the student's name and number 
and the course number, was stored on magnetic tape along with 
sums and sums of squares for each sample of five. A retrieval sys- 
tem made it possible to find and relist any student's data set in 
case of loss, 

When the data sheets were distributed to a class, a description 
of the system and its purposes was given, along with basic com- 
puter concepts, For orientation purposes, the first assignment was 
for each student to interpret the data as a meaningful measure- 
ment in his own field. The basic four digit data with tavo decimals 
became bushels per acre, pounds of wool, pounds of milk, dollars 
and cents, test scores, and many other common measurements. 

Computational assignments were made to coincide with lecture 
material. An assignment of any problem type was made by refer- 
encing the necessary samples or combinations of samples on which 
the computation were to be performed. For example, an unpaired t- 
test assignment could consist of sample one versus sample two, 
sample four versus sample 12, sample one, two, three versus sam- 
ple seven and eight, etc. 

The student performed the necessary computations and recorded 
the desired answers on an answer sheet in specified form along 
with samples used and problem type code. It was not necessary for 
all students to work the same problems nor even the same prob- 
lem type at a given time. These answers were then keypunched for 
grading. 

The grading program located the student’s data on the magnetic 
tape and computed its own answers according to the problem type 
code and samples used. A variable tolerance prescribed for each 
problem type was used in checking answers (+ .01 for means, + 
-10 for variances, etc.) because of different accuracy obtained by 
the computer as opposed to desk calculator. The program com- 
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puted a weighted score for each problem set and printed out a list 
of student answers, computer answers, total possible score, and 
achieved score or grade. A grade tape was updated giving the 
student eredit for the problem set once and an error note was 
printed if a student turned the same sheet or same problem set in 
the second time. The information on this grade tape was printed 
out in full and with subtotals for distribution to the students near 
the end of the term. Allowance was made for late problems. The 
final problem grade was supplied to each instructor for use in de- 
termining course grade. Several sections were handled concurrently 
with no difficulty. 
Discussion 

Since a number of different type problems may be run at one 
time, it was deemed necessary to have the entire package in one 
program. Because of a lack of storage capacity, further enlarge- 
ment of this system is not possible without going to a retrieval 
system from either magnetic tape or random access storage. 


To show the ease with which this system was used, approximate 
times for a class of 30 people assigned 8-10 t-test problems are: 


1. Explanation of answer sheets 20 minutes (1st assignment) 


2. Keypunching 45 minutes 

3. Clerical work readying 10 minutes 
answers for input 

4. Computer time - 3-5 minutes 


Some problems in understanding of answer recording occurred 
and considerable time was spent in orienting the students to 
computer input-output format, format statements, scaling, and ele- 
ments of flow-charting and coding. This activity was not wasted 
effort as the students became very much computer oriented, and 
many were found using the computer later for their thesis prob- 
lems. 

Because of lack of time, no chi square problems had been pro- 
grammed, and at this stage in the course the usual book-type prob- 
lems were assigned with the addition of the usual requests for 
interpretation and write-up which provided an opportunity to 
observe more than arithmetic efforts. The understanding of the con- 
cepts by most students was judged to be unusually good and the 
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chi square write-ups, graded in the old laborious manner, were 
deemed exceptional. Although such outeomes might be attributable 
to the students’ having returned to a familiar regime, it is felt that 
they are more likely due to a better grasp of statistics. Whether or 
not this is due to the computer technique is not known. One thought 
is that the students have been so busy trying to beat the machine 
and/or so confused by the detail of preparing acceptable answer 
forms that they have unconsciously learned the statistical methods, 
or perhaps the statistics fade to simplicity compared to the com- 
puter jargon. It will be some years before the retention of the 
statistical techniques by these first guinea-pig students can be 
evaluated, and any attempt to determine comparative retention is 
probably impossible. 

The equipment used in this grading system included an IBM 
1401 computer with 8K storage, two magnetic tapes, optical scan- 
ner, and two random access disks. A future possibility is to use 
the optical scanner to read answer sheets directly and to eliminate 
most of the keypunching. The recording would be no more difficult 
for the students, but a sequencing problem is involved in process- 
ing. For efficient use of the data tape, the student answer input 
should be ordered by student number. This is easy with cards, but 
not with sheets. 


Summary 


Major problems encountered when teaching statistical methods 
to large classes are: (a) providing realistic problems which will be 
understood by students with varied backgrounds, and (b) grading 
volumes of homework problems. A computer program and a set of 
normal random numbers are used to generate data suitable for 
computation of basic statistical constants and tests. Students re- 
cord their answers on special forms from which they are key- 
punched. A second computer program computes the correct results, 
compares them with the students’ answers, and gives each student 
а grade which can be weighted differently for each problem. The 
system has been used for one year at CSU and some comments 
on the outcome have been reported. 

The writers are confident that the students learned at least as 
much as or more about statistics from the new approach as by 
the old book-problem laboratory-session instruction. They also 
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learned a great deal about computers, computer applications and 
capability, and data recording. Each instructor was able to devote 
more time and care to lecture preparation and to personal com- 
munication with students, plus, the very rewarding escape from 
the tedium of grading. 
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USE OF MARK-SENSE CARDS WITH ELEMENTARY 
SCHOOL CHILDREN 


THOMAS Q. CULHANE лхо QUENTIN C. STODOLA 


Northern Michigan University 
Marquette, Michigan 


Manx-sENSE cards have provided a useful means for collecting 
data from adult and older children respondents. But can children 
at the lower elementary school level follow the necessary mark- 
ing instructions with reasonable aceuracy? A partial answer to this 
question is found in the results of using mark-sense cards with 
pupils in elementary grades at the laboratory school of Northern 
Michigan University. 


Procedures 


Children in grades one through eight of the John D. Pierce 
Laboratory School were asked their opinions of a new schedule for 
the school week. They were given a list of thirteen statements, and 
Were requested to indicate the extent to which they agreed or dis- 
agreed by marking an IBM mark-sense card. 

In grades one, two, and three, the statements were read to the 
children, In grade three, the pupils were also given a written list of 
the statements. In other grades, the children responded directly to 
the written statements alone. 

Directions for the mark-sensing were given to pupils in grades 
one and two by the school principal. Directions to the other classes 
Were given by a graduate assistant. A summary of the procedures 
i in instructing the examinees on use of mark-sense cards fol- 
OWs: 

Prior to beginning responses, the children were shown a mark- 
Sense card on an overhead projector. (This is the same card used 
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for examinations in University courses.) After becoming familiar 
with the format of the card, the pupils were shown (again via 
overhead projector) an enlarged drawing of the "top" half of the 
card, on which was demonstrated the proper way to mark re- 
sponses. The first item was used as a practice item, with all exami- 
nees directed to mark the answer in the same response position. 

The second and third items were also practice statements, but 
after some discussion of the scaling system, pupils marked in their 
own choice of responses. Then the children were instructed to re- 
spond to the remaining thirteen items on the questionnaire. 


Findings 
The data in Table 1 summarizes the responses of the pupils to 
the thirteen items by indicating the number of enrollees in each 
grade, the number of cases in which children erroneously made 


two marks for a single item, and the number of marks which were 
too light for machine scoring. Per cents are given in parentheses. 


TABLE 1 
Summary of Double Marks and Light Marks, by Grade 
No. of students No. of No. of students 
Grade responding Double Marks* light marks* making light marks? 
1 20 3 (1.2%) 26 (10.0%) 8 (40.0%) 
2 23 2 (0.7%) 19 (6.4%) 6 (26.1%) 
3 17 0 (0.0%) 13 (5.9%) 2 (11.8%) 
4 28 2 (0.6%) 1 (0.3%) 1 (3.6%) 
5 24 1 (0.3%) 36 (11.5%) 9 (37.5%) 
6 23 2 (0.7%) 4 (1.4%) 1 (4.3%) 
7 24 0 (0.0%) 0 (0.0%) 0 (0.0%) 
8 23 0 (0.0%) 0 (0.0%) 0 (0.0%) 
Total 182 10 99 27 


ао с SR Sm Hw v т ro 
* Per cents based on the number of pupils responding times the number of items (13) on the 
questionnaire. 


ъ Per cents based on the number of pupils in each class. 
Discussion 
On the basis of these results, it is believed that the use of 
mark-sense cards for listing pupils’ responses is practical with chil- 
dren as low as grade one. The major mechanical problem en- 
countered was that of pupils’ marking so lightly that their re- 


sponses were not counted. It is felt that more emphatic directions 
to children on the necessity of making heavy marks plus addi- 
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tional practice in use of mark-sense cards, would eliminate a 
large percentage of light marks. 

It may be noted that people in the very lowest grades did not 
have the highest proportion of light marks, but rather the fifth 
grade pupils erred most in this respect. Possibly the directions were 
not given to the fifth grade children following the standardized 
procedure. 

It is anticipated that mark-sense cards will be widely used in 
the John D. Pierce School even at the lowest grade level as an ef- 
fective means of obtaining important data directly from the pupils 
themselves. 
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A FORTRAN IV PROGRAM FOR THE ANALYSIS OF 
DEMOGRAPHIC, ITEM, AND SCALE DATA? 2 


JOSEPH L. FLEISS лхо ROBERT L. SPITZER? 
Biometrics Research and Columbia University 


Тн1з paper describes a computer program and various subrou- 
tines for the analysis of frequency distributions across subclasses 
of demographic variables, of responses to dichotomous (yes-no) 
items, and of scale scores obtained as either weighted or un- 
weighted sums of dichotomous items. The authors wrote the pro- 
gram out of a need to analyze large volumes of data collected on 
the Mental Status Schedule (Spitzer, Burdock, and Hardesty, 
1964) and Psychiatric Status Schedule (Spitzer, Endicott, and 
Cohen, 1965). These instruments consist of structured interview 
schedules and of accompanying inventories of 248 and 492 dicho- 
tomous items, respectively. An examination of existing computer 
Programs indicated that none was quite suitable for the kinds of 
analyses required by investigators employing the instruments, and 
that none took full advantage of all features of the instruments. 
It was further found that few programs gave adequate identifica- 
tion of the groups being compared or of the variables being 
analyzed in their printouts. A feature of all printouts by the 
Program described below is the clear labeling of all samples and 
variables, 

The illustrations of printouts in this paper are from analyses 
of Mental Status Schedule or Psychiatric Status Schedule data. 


Jime Rem . 

1 This work was supported in part by grant MH 08534 from the National 
Institute of Mental Health. To 

*All printouts are from the IBM 7094 at the Columbia University Com- 
puter Center, 

8 Program listings, decks, and detailed instructions for punching control 
va ate available from the authors (address 722 West 168 St., New York, 

+ 10032). 
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The program is sufficiently flexible, however, to be of use with any 
- questionnaire, inventory, or self-rating device possessing the fol- 
lowing features: 

(1) Each item is dichotomous. 

(2) Seale scores are either weighted or unweighted sums of item 
responses. 

(3) Some scales, dubbed role scales, are scored only for subjects 
meeting a specified criterion. For example, the Psychiatric Status 
Schedule contains a series of items concerning the adequacy of 
a subject’s performance as a parent. These items are judged, and 
the corresponding Parent Role scale is scored, only for subjects 
who are parents. 

'The following limitations must be maintained: 

(A) The total number of items must not exceed 500. 

(B) The total number of scales, both the role scales and those 
scored for all subjects, must not exceed 50. 

(С) The total number of role scales must not exceed 20. 

(D) Fer each role scale there must be an item, dubbed a cri- 
terion item, which indicates whether the criterion is satisfied. 

(E) The total number of items assigned to the scales which are 
scored for all subjects must not exceed 800. 

(F) The items comprising the scales which are scored for all 
subjects may be given differential weights. The items comprising 
the role scales may not. 

(G) The total number of demographic variables (the criterion 
items are considered as demographic variables for purposes of 
analysis) must not exceed 50. 

(H) The total number of classes across all demographic variables 
must not exceed 280. 

(I) The responses to the non-criterion items must be so inter- 


pretable that one of the two responses may be identified as “posi- 
tive.” 


и 
Punching the Data v 


The raw data are punched as follows. The first column of each 
punched card always designates the card number, and columns 
2-6 of each card contain a unique number identifying the score 
sheet being processed. Beginning in column 7 of card 1 are punched 
any demographic or identifying data for the subject, excluding the 
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criterion items, desired by the investigator. Beginning in column 
7 of card 2, and continuing for as many cards as are necessary, are 
punched, in successive three column fields, only the numbers of the 
jtems responded to in the positive direction (see (I) above) and of 
the criterion items judged true. Column 79 is always left blank, 
and column 80 is left blank for all cards save the last, where it is 
punched “-” (1.е., an “11” punch). The usual practice of assigning 
each item a unique column to be punched either 0 or 1 was aban- 
doned because it often led to an excessive number of cards per sub- 
ject and because the frequent presence of a long string of 0’s led to 
easy tearing of cards. 


Identifying the Types of Items 


A BLOCK DATA subprogram provides the array, ITYPE, 
which determines the type of each item. All criterion items are 
given an ITYPE of 0. All items which are judged only if a certain 
criterion item is true are given an ITYPE which is the number of 
that criterion item. For each remaining item the ITYPE is minus 
the weight it is to be given in scoring. 

Thus, if an instrument consists of ten items, where the first three 
are scored of all subjects and given unit weight, the second three 
are scored of all subjects and given a weight of two, the seventh 
is a criterion item and the remaining three are judged only if the 
seventh is true, then the BLOCK DATA subprogram might be as 
follows: 


BLOCK DATA 

COMM@N/IBLOCK/ITYPE 

DIMENSION ITYPE (10) 

DATA ITYPE/3*-1, 3*-2, 0, 3*7/ 

END 
Main Program for Comparing Groups 

Subroutine INPUT is called to ре in, from either punched 

cards or magnetic tape, all of each subject’s data. If educational 
and occupational levels are among the demographic variables 
punched on the first card, and if they are punched according to the 
Hollingshead code (Hollingshead, 1957), then the main program 
determines the subject’s social class using Hollingshead’s formula. 
It also assigns the subject to his appropriate five year age inter- 
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val. Finally, if the number of a criterion item was not punched 
(indieating that the item was not true), but if items dependent on 
it were judged in the positive direction and hence punched, then 
the main program will alter the response to the criterion item from 
false to true. 

If comparisons between different groups of subjects are desired, 
or if data from only a sample of subjects from a larger data file 
are to be analyzed, the program will either assign each subject 
to his appropriate group or reject his data and read in the next 
subject's data. The assignment of subjects to groups is made by 
subroutine CATEGY, utilizing logical tests. Group membership 
therefore may be defined by the presence or absence of any number 
of demographic characteristics. It is thus possible to partition a 
sample into such groups as married men over the age of 50 without 
an organic brain disorder; married men over the age of 50 with an 
organic brain disorder; ete. Since the groups to be compared will 
usually vary from study to study, the investigator will have to 
have the ,Drogram for subroutine CATEGY written and com- 
piled. A maximum of nine groups can be handled. 

Depending on the analyses desired, one or more of the following 
subroutines may be called. If only a summary description of 4 


sample is desired, the subroutines will skip over all group com- 
parisons. 


Comparisons on Demographic Data 


Subroutine DEMO compares the groups with respect to distribu- 
tions across the classes of each demographic variable punched on 
the first data card and of each criterion item, computes the overall 
chi square and contingency coefficient (Peters and Van Voorhis, 
1940) for each demographic variable, and, if desired, computes 
the chi square and contingency coefficient for each pair of groups. 
Printed output consists, for each demographic variable, of thirty 
character descriptions of the variable itself and of each of its 
classes, the raw and relative frequencies of all classes within each 
group, and the summary statistics (see Figure 1). 


Comparisons on Items 


Subroutine ITEM compares the groups with respect to the 
percentages responding positively to each non-criterion item. Over- 
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BIOMETRICS RESEARCH MENTAL STATUS SCHEDULE BOWERY STUDY 
1 2 


TOTAL NON- ALCO- NOT 
GROUP ALCHLC HOLIC CODEO 
0 


DEMOGRAPHIC VARIABLE TOTAL NO. OF CASES = 100 43 57 
16 EDUCATION 
1 PROFESSIONAL NUMBER = a a a 0 
PERCENT = 0,00 0.00 0.00 0.00 
2 COLLEGE GRADUATE-FOUR YEARS NUMBER = a а а 0 
PERCENT = 0,00 0.00 0.00 0.00 
3 1-3 YEARS COLL.,BUS.SCHLe NUMBER * T7 2 5 0 
PERCENT = 7.00 4065 8.77 0.00 
4 HIGH SCHOOL GRADUATE NUMBER = 17 é 11 0 
PERCENT = 17.00 13.95 19.30 0.00 
5 10-11 YEARS OF SCHOOL NUMBER = 21 8 13 0 
PERCENT = 21.00 18.60 22.81 0.00 
6 7-9 YEARS DF SCHOOL NUMBER = 31 16 15 0 
PERCENT = 31.00 37.21 26,32 0.00 
7 UNDER 7 YEARS OF SCHOOL NUMBER = 23 1l 12 0 
PERCENT = 23.00 25.58 21,05 0.00 
& NOT CODED NUMBER = 1 0 1 0 
PERCENT = 1.00 0.00 1.75 0,00 


ALL CODED CATEGORIES PAIR 1,2 


CHI SQUARE 2.36 2.36 
CONTINGENCY COEFF» 0.152 0.15 
DEGREES OF FREEDOM 4 ^ 


Figure 1. Example of Printout from Subroutine DEMO 


all chi squares and contingency coefficients are computed for each 
item, and, if desired, chi squares and phi coefficients for each pair 
of groups. Printed output consists, for each item, of a thirty char- 
acter description of the item, both the number and per cent of each 
group responding positively to the item, and the summary statis- 
ties (see Figure 2). 


Comparisons on Scales 


Subroutine SCALE calculates for each subject his scores on all 
the non-role scales, employing the weights given in the ITYPE 
array; his scores on all the role scales for which the associated 
criterion items were true; and his total score (ie., the weighted 
sum of all non-criterion items judged in the positive direction). 
If desired, it gives punched and/or printed output of all his 
scores, After all the data have been read in, it compares the groups 
with respect to the mean scores on each scale, computing an 
overall F and episilon (Cohen, 1965) and, if desired, t’s and 
epsilons (i.e., point biserial r’s) for each pair of groups. Printed 
output consists, for each scale, of a thirty character description 
of the scale, the mean and standard deviation of each group, and 
the summary statistics (see Figure 3). 

The number of control cards (giving descriptions of all demo- 
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graphic variables, items and scales, classes within demographic 
variables, options as to eard or tape input and pair-wise com- 
parisons, etc.) is sizeable if all three computing subroutines are 
to be called in one run. To permit a check on them, all control 
eards must be numbered in columns 78-80. If the main program 
finds that a required card is missing or out of order, or that an 
array size has been exceeded, it prints a message describing the 
error and stops. 
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DATA PROCESSING OF MACHINE SCORED 
QUESTIONNAIRES 


PAUL LANE WUEBBEN 
GRETCHEN B. TIMMERMANS 
AND 
PERRY R. TIMMERMANS 
University of California, Santa Barbara 


Tuts paper describes a program for conversion of output from 
the IBM 1230 scoring machine into the punch format required for 
further data processing. The IBM 1230 document No. 505, the 
answer sheet used for scoring by the IBM 1230, contains spaces 
for up to 150 multiple choice (5-alternative) responses. IBM an- 
swer sheets may be profitably used in collecting subjects’ responses 
to many personality scales or questionnaire items. The IBM 1230 
scoring machine reads directly responses on the IBM answer sheet. 

Output of the IBM 1230 is in two forms: (a) the summary score 
for a set of items is printed along one of the margins of the answer 
sheet, and (b) a card which records the subjects’ response to each 
item is punched. These punched cards, however, cannot be used by 
standard IBM data processing equipment since the IBM 1230 scor- 
ing machine compresses, or “codes,” data which would normally 
require two card columns into one card column. Thus direct utili- 
zation of these cards for most kinds of further analyses, such as 
factor analysis or item analysis, is precluded. 


Program 


The program is a subroutine which is suitable for insertion into 
any main program designed to yield summary scores for any 
subset of the 150 items of the answer sheets. The subroutine is 
written in FORTRAN II-D for the IBM 1620 computer with 40K 
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memory and disk. This subroutine will convert any “coded” data 
cards from the IBM 1230 scoring machine into standard punch 
format. 

The input consists of punched output from the IBM 1230 pre- 
ceeded by two parameter cards. (a) The first parameter card states 
the total number of columns to be interpreted. (b) The second 
parameter card states the number of blank or empty “words,” or 
half-columns included in the total number of columns to be in- 
terpreted. The subroutine reads a data card, interprets the data, 
and then returns to the main program. After the subroutine re- 
turns control to the main program, the interpreted data are im- 
mediately available for further analyses. 

The utility of the subroutine may be illustrated by its function 
in a primary program which computes scores of two widely used 
personality scales: the Crowne-Marlowe (1960) scale of so- 
cial desirability and the Alpert-Haber (1960) test anxiety scale. 
The combined output of this primary program and the subroutine 
is a set of two punched cards for each subject; the first card is the 
interpreted raw data card and the second is the subject’s com- 
puted scores for the social desirability and test anxiety scales and 
subscales, 

A printed copy of the subroutine as it appears in the primary 
program for scoring the Crowne-Marlowe social desirability scale 
and the Alpert-Haber test anxiety scale is available upon request 
from the primary author. 


REFERENCES 


Alpert, R. and Haber, R. N. Anxiety in Academic Achievement 
Situations. Journal of Abnormal and Social Psychology, 1960, 
61, 207-215. 

Crowne, D. P. and Marlowe, D. A New Scale of Social Desirability 
Independent of Psychopathology. Journal of Consulting Psy- 
chology, 1960, 24, 349-354. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1967, 27, 197-199. 


AN IBM SCORING ROUTINE FOR THREE-CHOICE 
INVENTORIES 


ROBERT A. JONES, CALVIN M. PULLIAS 


University of Southern California 
AND 
WILLIAM B. MICHAEL 


University of California, Santa Barbara 


Tun relatively wide distribution of small scale computers has 
resulted in the increased availability of low cost computer time 
for measurement research. The difficulty of scoring of weighted 
choice inventories has generally tended to discourage the devel- 
opment of such inventories. The thought that the availability of a 
quick scoring routine could encourage research into the develop- 
ment of inventories or additional scales for existing inventories 
led the authors to the development of a scoring routine. 

The utilization of optical scanning equipment which could 
sharply reduce the clerical labor required to convert response data 
to punched cards seemed desirable. For these reasons, & set of 
answer sheets was designed for the IBM 1230 Optical Scanner, 
and a computer routine to utilize the 1230 card output was writ- 
ten for an IBM 1401 (8K, special features, one tape drive). The 
routine which is written in SPS (SOPAT version) completes 
scoring and printing for each candidate at the rate of approxi- 
mately one second per scale. 


Characteristics of the Program 


Scope 


The routine permits an inventory to contain a maximum of 
400 three-choice items. Each response for an item is individually 
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weighted. Each weight must be an integer between —9 and +9 
inclusive. The number of scales is relatively unlimited. Modifica- 
tion of the routine to permit more items, more alternatives, or 
greater weights would usually not require major reprogramming. 


Subject Data 


Assume the inventory to have 400 three-choice items. Seven in- 
put eards would be required for each candidate. The first card 
which is keypunched manually contains the candidate’s name or 
other label if desired (columns 1-28), a unique subject ID (col- 
umns 76-79), and card ID (zero in column 80). The other six 
cards except the last, each contain 75 responses, 4 digits of sub- 
ject ID, and a digit of card ID. The last card contains, of course, 
only 25 responses. Sets of cards may be stacked together, but an 
error in deck set-up which results in a septuple not all having the 
same subject ID or any inversion in card order will result in the 
termination of the scoring run. 


Keys S, 


A key for a single scale on a 400 item three-choice inventory re- 
quires 1200 weights. By punching the 75 weights for 25 items 
into columns 1 through 75 of a single card, a complete key includ- 
ing a label card can be contained in 17 cards. (In order to ac- 
commodate both the value and sign of a negative weight in a single 
card column eleven-zone overpunching is utilized—i.e. “—2” is 
punched as “K”.) Since some inventories have a large number of 
scales, a complete deck of keys can be quite large. For this reason a 
small program was written to load the keys on a magnetic tape 
which may be stored and mounted as needed. If separate keys are 
available for men and women, they are stored in separate files on 
the tape. While scoring could proceed without the use of tape, the 
process would require repeated passes over the key deck and would 
consume much more computer time. 


Method of Scoring 


All 400 responses of one person are stored as an array. The key 
read from tape contains three-one-digit numbers for each question. 
According as the response to a given question is 1, 2, or 3, the 
first, second, or third number of the triad corresponding to that 
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question is added (algebraically) to the raw score. After all 400 
questions have been scored, a standard score is also computed from 
the mean and standard deviation recorded on the key label card. 
The key label, raw score, and standard score are printed, and the 
next tape record is read to initiate scoring on the next scale. When 
a tape mak is reached, the tape is rewound and the data cards for 
another person are read. The first tape file is skipped if there is 
“F” in column 71 of the name/label card. 


Output 

Appropriate headings may be printed at the top of the page with 
the name/label printed from the first data card beneath the head- 
ings. Below are listed the scales, raw scores, and standard scores. 
If an error in deck set-up is detected, an appropriate statement is 
printed. Depending on the nature of the error detected, execution 
either is halted or skips to the next person. Output for each person 
appears on a separate sheet. 


Availability 
A listing of the program will be supplied free of charge. How- 


ever, source decks will be supplied at a nominal charge to cover 
postage and handling. 
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A FORTRAN IV PROGRAM FOR EVALUATING 
INTERNAL CONSISTENCY AND SINGLE- 
FACTOREDNESS IN SETS OF MULTILEVEL ATTITUDE 
ITEMS! 


DONALD G. MORRISON 4x» DONALD T. CAMPBELL 


Northwestern University 
AND 
LEROY WOLINS 


Iowa State University 


Hanpuine data from attitude scales up to 47 items long, ad- 
ministered to an unlimited number of respondents, for whom each 
item response is coded within one column (e.g., values 1 to 5), 
the program computes split-half and Kuder-Richardson (Coeffici- 
ent Alpha) reliabilities, interitem correlations, corrected item-total 
correlations, average interitem coefficients, response-set analysis, 
single-factor extraction with Lawley’s Test of significance, plus 
an index of single-factoredness. A manual provides descriptive 
background and citation as well as instructions for program use. 
Instructions for conversion to CDC Fortran are included. The pro- 
gram is distributed by the Vogelback Computing Center, North- 
western University, Evanston, Illinois, 60201, as NUCO048 
ATTANAL. 


————— 
1 Supported in part by Project C-998, Contract 3-20-001, U. S. Office of 
Education, 
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A FORTRAN PROGRAM FOR GUTTMAN AND 
OTHER SCALOGRAM ANALYSES? 


ROLAND WERNER 
Syracuse University 


Tuts Program computes the Guttman reproducibility coefficient, 
a scalogram presentation, and modifications suggested by Good- 
man, Sagi, Green, Borgatta, Menzel, and Schuessler. In addition 
it computes from the same data Loevinger’s Homogeneity index 
and Kuder-Richardson reliability coefficient. The 85 page manual 
contains background explanation, bibliography, operating instruc- 
tions, a Fortran program listing, and sample output. Manual and 
Program are available from The Systems Research Committee, 
216 Ostrom Ave., Syracuse, New York, 13210. 


1 Supported in part by Project C-998, Contract 3-20-001, U. S. Office of 
Education. 
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The Challenge of Response Sets by Jack Block. New York: Ap- 
pleton-Century-Crofts, 1965. Pp. x + 142. $5.00. 


It is entirely in keeping with the spirit of free scientific inquiry 
and debate that criticism be directed at previous findings or con- 
clusions. If a member of the scientific community feels that others 
are getting ahead of him, it is the critic’s right, even his duty, to 
call them up short, to state the bases for the disagreement, to 
provide alternative analyses and interpretations—in short, to ex- 
ercise the scientific privilege of saying “no.” Often such debate 
stirs the interest of the scientific community into further research 
to resolve thorny questions, As in a court of law, the scientific com- 
munity weighs the evidence and reaches a verdict. Unlike the 
court of law, scientists may seek fresh data, meanwhile suspending 
judgment for months or even for years. . 

Jack Block has exercised the privilege of challenging the findings 
linking the MMPI to response sets. His stated purpose (p. 2) is 
“to contest the interpretation that ‘acquiescence, as moderated by 
social desirability, plays a dominant role in personality inven- 
tories like the MMPI’ (Messick and Jackson, 1961).” In an at- 
tempt to buttress this contention, Block has devoted much time 
and effort, engaging in a variety of computational strategems, and 
in lengthy, even (by his own observation) tedious arguments. 

The main objects of Block’s criticisms are a series of studies 
by Jackson and Messick and a second series of studies by Edwards 
and his colleagues into the internal structure of the MMPI. In 
short, Jackson and Messick in a series of factor analytic investi- 
gations of the MMPI on groups of prison inmates, college students, 
and psychiatric patients, found two very large factors. One of 
these completely separated true from false keyed MMPI sub- 
Scales, and the other yielded factor loadings substantially corre- 
lated with independent average judged scale values of desirability. 
Scales constructed by randomly selecting items at particular levels 
of desirability loaded factors consistently in a manner predictable 
from their desirability scale values and their keying which was 
entirely in the true direction. Research by Edwards and his stu- 
dents similarly has been directed at a number of variations on the 
theme that his SD scale is predictive of a major portion of MMPI 
scale variance. While each of these sets of investigators have stud- 
led both the tendency to agree and to respond desirably over di- 
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verse content, Block adopts the strategy of focusing his attack in 
the early part of the book upon the Jackson and Messick studies 
for their conclusions about acquiescence, later turning to challenge 
Edwards’ interpretations of social desirability. 

Block's major arguments may be summarized as follows: (a) 
“Pure” acquiescence measures of inventory items are difficult to 
construct, certain of these admitting of a content interpretation, 
(b) factors emerging before and after purported “elimination” of 
acquiescence variance are reported to be similar, (c) the social de- 
sirability (SD) dimension and certain characterological dimen- 
sions are colinear on the MMPI but the SD hypothesis is an “heur- 
istic failure” in other behavioral domains, (d) the possibility of 
constructing a “subtle SD scale” with items in the neutral range 
of desirability is interpreted as demonstrating that the SD no- 
tion is related only “adventitiously” to a personality dimension of 
fundamental behavioral significance; (e) the first two factors of 
the MMPI have correlates in independent ratings of personality, 
and (f) psychometrically-oriented investigators have been una- 
ware of the “clinical lore” surrounding the MMPI. Thus, Block 
concludes (p. 119) that the “analyses reported in this monograph 
support rather well the MMPI as initially conceived and as tra- 
ditionally employed.” 

‚ The quasi-technical nature of some of these analyses and their 
circuitous logical development require careful evaluation. Those 
predisposed to the traditional conception of the MMPI will per- 
haps be happy to embrace uncritically Block’s conclusions on the 
authority of his reputation and the book’s receipt of the Century 
Psychology Series Award of 1964. Even those who were con- 
vinced by the earlier response set research may be impressed by 
Block’s strong advocacy of the MMPI, and by his energy in at- 
tacking what appear to be faults in the earlier research. But this 
is not a book to be read uncritically. Certainly, Block’s stated 
position in the Preface of being “squarely, and with little com- 
promise” against the response set position, might caution the reader 
to be wary of special pleading. A critical reading of the book will 
frequently reward the reader with the positive affect that accom- 
panies the identification of logical error. Since these errors are 
quite specific to each analysis, they must be uncovered one by one. 

In a chapter entitled “A Negative View of Acquiescence Re- 
search” Block reviews the factor analytic studies of Jackson and 
Messick (1961, 1962), noting that they were able consistently to 
separate true and false keyed subscales on a factor labeled acqui- 
escence. Under the heading “Artifact and Content in the Analysis 
of ‘True’ and ‘False’ MMPI Subscales,” Block states: 

Unfortunately, the analytical design used b; ick—while at 
first glance appealing—has the bad luck ee ce ath ine 


BOOK REVIEWS: ^ | 9 


presence of item overlap among MMPI scales. Although it is well known 
that such item overlap exists, its extent and effect is, as a rule, not recognized, 
In the Jackson and Messick design, the effect of item overlap can be mis- 
leading. 


Block then devotes several pages to a discussion of how item over- 
lap may affect factor structure. 

Actually, the implication that Jackson and Messick were una- 
ware of the problem of item overlap or took no steps to con- 
trol for this is entirely false and misleading. Nowhere in this 
chapter does Block mention that Jackson and Messick in con- 
structing their special response style marker scales, systemat- 
ically minimized item overlap with MMPI clinical scales. Nor does 
he even hint that they sought to render the effects of item overlap 
harmless by rotating item overlap factors to a position orthogonal 
to response style and to content factors, carefully identifying fac- 
tors of item overlap by comparing the residual correlations repro- 
duced by their loadings with the correlation expected by the Ме- 
Nemar “common elements” correlation formula. Instead, Block 
proceeds to use this same formula to re-calculate these item over- 
lap correlations. He then factor analyzes these correlations in an 
attempt to demonstrate that the two large MMPI response style 
factors found by Jackson and Messick “are susceptible to an al- 
ternative explanation.” As Messick (1965) has recently demon- 
strated, Block’s factor analysis of item overlap correlations, when 
properly rotated, yields two factors which conform almost per- 
fectly to two of the item overlap factors identified and rotated to a 
neutral position by Jackson and Messick. 

Block implies that he has “discovered” a failure to control for 
item overlap by Jackson and Messick. But even a casual reading 
of the Jackson and Messick studies reveals that they not only 
were well aware of the problem, but utilized a carefully devised 
psychometric procedure to overcome the difficulty. What sort of 
evaluation can one place upon scientific writing that might seri- 
ously mislead a reader who has not read the original articles? 
Block’s failure to identify a crucial phase of this previous research 
hardly can be considered careful or responsible scientific reporting. 
Nor does Block’s critical concern about item overlap in the Jack- 
son and Messick studies extend to his own work. In his own at- 
tempts to factor analyze the MMPI reported in the monograph, 
Block fails to introduce any controls for item overlap. This has 
serious consequences, as will be demonstrated. 

Block devotes considerable space in an attempted refutation of 
а response set interpretation of the Welsh “pure factor" MMPI 
scales, which were developed empirically to represent the first two 
factors of the MMPI. The Welsh A scale contains 38 of 39 items 
keyed true, while the R scale contains 40 items, all keyed false. 
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Block quotes Messick and Jackson, who noted that the negative 
correlation between the A and R scale was not sufficiently high to 
interpret the results entirely in terms of a response set of general- 
ized acquiescence. Messick and Jackson suggested that the cor- 
relation could be accounted for by invoking a second dimension, 
the stylistic tendency to respond desirably. Block states (p. 23) 
“Although Messick and Jackson place much emphasis upon this 
conjectural possibility, they have not examined it empirically.” 
Block is incorrect. He has overlooked the fact that the A and R 
scales were included in the three published factor analytic studies, 
in which acquiescence and desirability factors were defined objec- 
tively by reference to randomly selected true keyed items at par- 
ticular levels of desirability. Block contends that since the A scale 
contains extremely undesirable items, and the R scale items tend 
to be intermediate with respect to social desirability, “clearly then, 
social desirability as the second primary factor within the MMPI 
cannot be invoked to explain away the orthogonality of A and R.” 
But the factor analytic results consistently indicate a moderately 
high loading for the A scale on the dimension identified as ac- 
quiescence by Jackson and Messick, with the R scale highly neg- 
atively loaded on this same dimension. The loadings on the di- 
mension identified as desirability by Jackson and Messick for the 
A and R scales are almost precisely what would be predicted from 
their respective desirability scale values. Furthermore, the cosine 
of the angle between the vectors determined by A and R on these 
factors almost exactly corresponds to their correlation. This would 
indicate that two factors such as desirability and acquiescence can 
very well account for the correlation between the A and R scales. 
Incidentally, it was convenient for Block’s argument here to con- 
sider the A and R scales orthogonal to one another. Actually, 
this is usually not the case. A correlation in the negative direction 
is most common. In Block’s own five samples, the median correla- 
tion is —.24, a correlation consistent with the median cosine of 
the angle separating the A and R vectors on the first two factors 
of the Jackson and Messick studies of — 39. 

i Block turns next to a factor analysis of the Jackson and Mes- 
sick Dy3 scale. This scale consists of 60 true keyed items se- 
lected randomly from items in the neutral range of judged desira- 
bility, with the restriction that item overlap between this scale and 
MMPI clinical scales be limited to not more than six items. This 
scale in all three samples had the highest loading on the factor 
labeled acquiescence by Jackson and Messick. Block’s reasoning 
in undertaking this factor analysis was to seek to find some in- 
terpretation attributable to content. If such could be found Block 
reasoned that the reliability reported for the scale and the factor 
labeled acquiescence might be reinterpreted in terms of content. 
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Block proceeded with his factor analysis of items, using phi co- 
efficients, and a principal axis solution without rotation to simple 
structure. Block (p. 26) states: “Only the first three factors ex- 
tract large and interpretable portions of the variance, while the 
remaining factors encompass little variance and are insufficiently 
defined.” In spite of the fact that all but one of the nineteen items 
loading the first factor are keyed true (the exception is a negation), 
Block interprets this first and largest factor as a measure of under- 
control and impulsivity, and hence it follows, according to Block, 
that all other MMPI scales loading the acquiescence factor can 
now be reinterpreted in terms of under-control. 

This line of reasoning, when examined carefully, can be faulted 
on a number of counts: (a) Block has chosen to single out Dy3 to 
seek a content interpretation of response consistency to this scale, 
and thereby to generalize these findings to all scales loading the 
factor labeled acquiescence. But this acquiescence factor is a gen- 
eral factor implicating virtually all MMPI scales. The factor pat- 
tern separating true and false keyed subscales would change lit- 
tle, if any, if Dy3 were omitted. Would Block expect a similar 
“ander-control” factor to emerge in separate factor analyses of 
items in each of these scales? Comrey’s factor analyses of MMPI 
items would not support such an expectation. (b) The technical 
aspects of Block’s analyses and reporting leave something to be 
desired. Block states that if one expected an acquiescence inter- 
pretation, one of two alternative factor structures should emerge 
in analyses of items: either a large general acquiescence factor; 
or a large number of slightly oblique factors, with acquiescence 
emerging at the second order. But nowhere does Block provide the 
data to appraise the evidence in terms of the proposed alternatives. 
Surprisingly, no data on the relative sizes of latent roots or num- 
ber of significant factors are reported. Furthermore, his failure to 
rotate leads inevitably to an imbalance in the variance attribut- 
able to the various factors. But, in any event, the reader must 
accept on faith Block’s decision to interpret only three factors and 
his conclusion that these results do not meet expectations from an 
acquiescence hypothesis. How prudent is it to invest one’s faith 
in Block’s interpretation? This is a matter for conjecture.t Given 
the relative lack of invariance of unrotated factor patterns de- 
tived from heterogeneous items, and the availability of several 
Samples, Block's failure to cross-validate or replicate this factor 
m 

*Professor Block has kindly provided the reviewer with a copy of the 
complete factor matrix for his analysis of the Dy3 scale. Using the highest 
column phi coefficient in the diagonal, Block obtained ten factors with latent 
Toots in excess of unity. If unity had been placed in the diagonal, it is esti- 


mated that at least eighteen factors meeting Kaiser's criterion of yielding 
latent roots in excess of unity would have appeared. 
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analysis is a serious criticism. This is particularly so when one 
notes that one of two negative impulsivity items load Block's first 
factor positively. Neither does Block report any data bearing on 
the obliqueness of factors or the emergence of an acquiescence fac- 
tor at the second order, and (с) even if a content factor such as 
impulsivity was associated with the Dy3 scale, this would not be 
inconsistent with an acquiescence hypothesis. Accumulating re- 
search has indicated some correlation between impulsivity and the 
species of acquiescence found on personality questionnaires. Thus, 
items reflecting both impulsivity and acquiescence should be most 
heavily saturated on an acquiescence factor, an expectation con- 
sistent with Block’s report of his findings. But it is an error in logic 
to conclude, as does Block, that the possibility of content variance 
necessarily excludes acquiescence variance, or that slightly cor- 
related tendencies are identical. In the light of these considerations 
one remains unconvinced that Dy3, or other of the scales prom- 
inently loading the second largest MMPI factor is proved to be 
innocent in respect to acquiescence. 

Block's next attempt to discredit the acquiescence interpretation 
on the MMPI is to attempt to construct “acquiescence-free MMPI 
scales,” end then to compare the results of refactorization of the 
MMPI with such scales with the results of a factor analysis with 
original scales. Block attempts to show that the factor structure 
emerging with “acquiescence-balanced” MMPI scales is similar 
to that emerging from analysis of original scales, With acquies- 
cence variance “removed,” according to Block, it is “indisputably 
clear” that the factor structure of the MMPI “does not change.” 
Unfortunately, the rationale for this procedure rests entirely on the 
adequacy of Block’s attempt to balance acquiescence on MMPI 
scales by balancing the number of true and false items on each 
scale. In spite of his lengthy protestations, such an attempt will 
inevitably fail with MMPI items. There are three important rea- 
sons why scales balanced for the number of true and false items 
might still yield scores influenced by acquiescence: (a) Items are 
not uniform in acquiescence-eliciting potential for several reasons. 
First of all, items differ in variance. An item with a p value of .5 
will have more than 25 times as much variance as an item with a 
p of .01. If acquiescence is a constant portion of the total variance, 
the first item would be unbalanced in acquiescence by a ratio of 25 
to 1. But the acquiescence eliciting potential of items are not uni- 
form over all items even if variances are the same. Several of the 
MMPI scales contain true keyed items which are unpopular and 
grossly psychotic, while false keyed items for the same scale are 
more moderate. It is naive to assume that a simple balancing of the 
number of such items keyed true and false will always have the 
desired effect. Block might better have balanced items in terms of 
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average true score variance or in terms of their loading on an ac- 
quiescence factor. But if Block had chosen to use the latter al- 
ternative, he would have had to admit that such existed. (b) Even 
if two halves of a scale were perfectly balanced for acquiescence, 
the presence of acquiescence distorts total scores. Thus a subject 
who responds entirely in terms of acquiescence must necessarily 
fall in the middle of the possible score range of the “balanced” 
scale, but on a skewed psychopathological scale, this might be far 
in the direction of psychopathology in terms of the distribution of 
observed scores. For example, if a subject responded entirely in 
terms of acquiescence on a scale containing 30 true keyed and 30 
false keyed items for which the mean score was ten, he would ob- 
tain a score of 30, far into the psychopathological range. (c) 
The reciprocal moderating relationship between acquiescence and 
desirability complicates the problem of balancing scales for ac- 
quiescence. Undesirable traits will inevitably yield more responses 
in terms of desirability for true-keyed items than for false-keyed 
items. False-keyed items for undesirable traits are usually more 
moderate, more ambiguous, and hence elicit more acquiescence. 
Block’s protestations about the difficulty of balancing such item 
properties are not an excuse for avoiding these complicated prob- 
lems, nor for claiming that scales “balanced” for number of true 
and false keyed items, but which suffer these problems, have 
acquiescence variance “removed.” 

It is curious that Block, in an attempt to buttress the rationale 
for “eliminating” acquiescence by balancing the true and false 
keying of items, engages in inaccurate reporting of the position of 
others. Thus, on page 39 he states “Some psychologists (Green, 
1954; Jackson and Messick, 1961) project a “single common fac- 
tor" view of MMPI scales. For them, the implicit mathematical 
model underlying an inventory scale presumes an equivalence or 
interchangeability of items, each item, in principle, contributing a 
small amount of common variance toward the total score.” A 
perusal of the Green Handbook of Social Psychology chapter re- 
veals no reference whatsoever to the MMPI. Green does discuss 
the single common factor model as the underlying model for the 
method of summated ratings in attitude measurement. But if Block 
considers this model a justification for balancing acquiescence by 
balancing the number of true and false keyed items, his faith is 
misplaced. Items would be equivalent only to the extent to which 
they met the requirements of this model that they each measure a 
common (content) factor, to which all of the items are mutually re- 
lated, and that other factors measured by items be uncorrelated 
across items. Block is wrong if he assumes that under such a model 
items can be considered to be equivalent in terms of variance, or in 
terms of extraneous factors such as acquiescence. The reference to 
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the Jackson and Messick article in this connection is particularly 
surprising because those authors emphasized the failure of MMPI 
scales to demonstrate homogeneity, even though they have been 
used increasingly to attempt to place respondents along latent di- 
mensions of psychopathology. 

An extended replication of Block's factor analyses conducted 
by Messick has revealed that the scales balanced for true and 
false items by Block are grossly unbalanced in terms of both their 
observed variances and their reliabilities. Furthermore, the first 
two unrotated factors removed by Block turned out to be by no 
means identical with the rotated factors interpreted by Jackson 
and Messick. Because Block failed to rotate axes, chose not to in- 
clude acquiescence marker scales, nor to separate true and false 
keyed items, his two largest factors are left in a position that is 
distinctly different from those interpreted by others. His second 
factor, upon rotation, is apportioned among three or four orthog- 
onally rotated factors and is partly anchored in its original po- 
sition by item overlap artifacts. Thus, these overlap artifacts in- 
evitably will lead to a spurious stability of loadings on Block's 
second factor. Had Block rotated factors based on separately keyed. 
true and false scales as did Jackson and Messick, the item overlap 
variance would have not intruded into the determination of load- 
ings on his second factor. When the Jackson and Messick Dy 
scales and true and false keyed subscales were extended into the 
factor space by projection methods in Messick's analysis, the Dy 
scales organized themselves perfectly in terms of their predicted 
location, and true and false keyed scales were separated as in pre- 
vious analyses. It is clear that the acquiescence factor identified 
in previous studies and Block’s second factor are not colinear in 
spite of his presumption to the contrary, that his attempted dem- 
onstration of the extent of this second factor before and after ac- 
quiescence “balancing” rests on most tenuous assumptions, and 
that, in general, Block’s factor analyses and interpretations were 
technically so inadequate that inferences therefrom are irrelevant 
to the earlier work. у 

Block moves next to a logical attack upon the social desirability 
hypothesis, largely as identified by Edwards and his co-workers. 
Block argues that the bulk of Edwards' evidence rests on the fact 
that his SD scale correlates highly with other MMPI scales. Block 
points out, however, that the SD scale, rather than being hetero- 
geneous as Edwards claims, actually contains content homogeneity 
Pii bis алеп According to Block, it is this con- 
ent homogeneity, not social irabili i i i 
correlations with MMPI кысыш, үа тшше high 

It is true that on the MMPI, pathological content is generally 
represented in the undesirable direction, and it is difficult to dis- 
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tinguish psychopathology from undesirable responding, especially 
since neurotic or psychotic traits generally are themselves unde- 
sirable. But there is now much evidence (which Block has chosen 
to ignore) linking undesirability of responding to dimensions other 
than those reflecting psychopathology. However, even within the 
MMPI, Block's two-valued logie misses the point of the social 
desirability analyses. Block contends that all of the MMPI scales 
measure some extremely general dimension of content which has 
been identified as desirability, a term Block feels lacks “heuristic” 
import. He prefers a term like “ego-resiliency.” But what heur- 
istic advantage does this substitution entail? Neither Jackson and 
Messick nor Edwards have denied that scales salient on the de- 
sirability factor necessarily lack convergent validity, as for ex- 
ample, in distinguishing normals from hospital patients. What 
they do suggest, however, is that scales whose mutual correlations 
approach their respective reliabilities cannot possess discriminant 
validity, nor does it make sense to call them by different names. 
Block continues his attempt to impeach the desirability con- 
ception by seeking to identify a set of items, neutral as to social de- 
sirability scale value, but correlating highly with the first factor of 
the MMPI. According to Block, such a “desirability-free’s scale is 
an “untenable anomaly” for the desirability hypothesis. Block 
proceeded to identify a set of 20 true-keyed and 20 false-keyed 
items having the desired properties. The resulting total scale (la- 
beled ER-S for Ego Resiliency-subtle) had a median correlation 
of .67 with the Edwards SD scale in nine samples. While this is 
not so high as the correlation between SD and other MMPI scales, 
it is significant, and is high in relation to the reliability of ER-S. 
One may question on logical grounds, as did Messick, the degree 
to which such evidence is an “untenable anomaly” for the desir- 
ability hypothesis. The search for subtle scales in MMPI test con- 
struction has a long history. Surely, the identification of reliable 
subtle psychopathological items would strengthen rather than 
weaken a construct such as psychopathic personality. By relying 
Solely on the mechanical application of rules to select items cor- 
relating highly with the first factor while neutral in desirability, 
Block practically guarantees that the scale so derived will have 
little substantive or theoretical import. As it turns out, the scale 
18 à curious anomaly, but by no means untenable from the van- 
tage point of the desirability hypothesis. 
_ Block has relied on the Messick-Jackson desirability scale values 
In constructing the ER-S scale. These values were based on the 
judged desirability of a true response, and for most purposes have 
served very well. However, Block is apparently unaware that there 
1s a certain type of item for which there are serious asymmetries in 
desirability scale values when considering true as opposed to false 
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responses. Consider the item “I do not worry about catching a 
disease," which appears on the Block ER-S scale. Presumably, it is 
somewhat pathological to be obsessed with catching a disease. But 
a true response to the denial of worry about disease implies only 
lack of fairly atypical pathology. This is not particularly desirable 
in itself; the desirability seale value is well within the neutral 
range. But what of a false response? This would indicate that the 
respondent was worried about catching a disease, implying un- 
desirable pathology. When Block's twenty true keyed ER-S items 
are reviewed, sixteen of them are found to be of this type, in- 
volving in one way or another a negation of an undesirable trait. 
To evaluate the possible effects of this sort of asymmetry on the 
Block ER-S scale, the reviewer has rescaled the items with respect 
to the desirability of a false response. It turns out that when re- 
evaluated in this way, Block's true keyed items, far from being 
neutral in desirability, possess desirability scale values well within 
the desirable range. Sixteen of the twenty true keyed items yielded 
scale values in the undesirable direction when scaled for the de- 
sirability of a false response. These scale values, when properly re- 
flected to make them comparable to the more usual values, indicate 
that the"Block “subtle SD scale” is keyed predominantly for de- 
sirable responding. While Block’s false-keyed ER-S items do not 
show this sort of asymmetry in scale values for the desirability of 
a false response among themselves, they are discrepant in relation 
to the true keyed items. 

Messick (1965) has uncovered a further uncontrolled source of 
variance in the ER-S scale. Noting that the stylistic tendency to 
respond deviantly is highly correlated with both desirability level 
and with the first factor of the MMPI, Messick found that 70 
per cent of the items were keyed in the more frequent direction ac- 
cording to college norms and that 80 per cent of the items were 
keyed either for desirability or frequency. Such a discrepancy be- 
tween frequency of endorsement and of desirability level is char- 
acteristic of items correlating highly with the Lie Scale, rather 
than with the MMPI first factor, a possibility also strongly sug- 
gested by an examination of the false-keyed ER-S items. Indeed, 
as Messick has observed, a study by Edwards and Walsh has indi- 
cated that ER-S loads 2 factor marked by the MMPI Lie scale 
more highly than ER-S loads the “Alpha” or the desirability fac- 
tor. It is likely that Block’s failure to rotate his axes resulted in а 
confounding of the Lie Scale factor with his Alpha factor, result- 
Ing In an erroneous conclusion about the nature of the ER-S scale. 

In summary, it is clear that Block’s “desirability-free measure 
of ‘social desirability’ ” is far from free of desirability, that true 
and false keyed items fail to load consistently the same desirability 
factor, and that logically there are scant grounds for concluding 
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that the existence of a "subtle SD scale" would embarrass the de- 
sirability conception. Indeed, the extreme content heterogeneity of 
these items would subvert any cogent content interpretation, such 
as that attempted by Block. Hence, the rationale for constructing 
the ER-S scale can only be considered naive and the conclusions 
derived therefrom considerably wide of the mark. 

In Chapter 8, Block attempts to show that there are correlates 
of the first two factors of the MMPI in Q-sort ratings of others, 
implying that such findings are somehow antithetical to a response 
style hypothesis. While there is a bewildering diversity of Q-sort 
statements linked to the first two factors, without a high degree of 
consistency from one sample to another, let us assume that Block 
has made his point, namely that there are such correlates. What 
inferences can one draw from this? Block’s logic (fallaciously) 
reasons that a response style cannot possess valid variance. More 
than fifteen years ago Cronbach explicitly recognized the alterna- 
tive. More recently, Jackson and Messick (1958) devoted an en- 
tire Psychological Bulletin article to the hypothesis that response 
styles, such as tendencies to respond desirably and to acquiesce, 
rather than being necessarily sources of Cumulative error, may be 
sources of valid variance. Characteristically, Block avoids refer- 
ence to this hypothesis in his attempt to “disprove” the response 
set interpretation of the MMPI by secking to demonstrate con- 
vergent validity for the two large MMPI factors. His findings here 
might well be taken as support for the stylistic interpretation of 
desirability and acquiescence. 

The final chapter, entitled, “Implications for Research on Struc- 
tured Personality Inventories,” is a potpourri of ad hominem 
argument, together with a curious juxtaposition of defense of the 
MMPI and suggestions about constructing better inventories to 
replace it. Throughout the book, and particularly in the last chap- 
ter, Block makes rather snide reference to psychometricians who 
are naive clinically. He states (p. 117) that “some unfortunate or 
unnecessary analytical and interpretive excursions might have 
been avoided if the appreciable knowledge and lore surrounding 
the MMPI . . . had been known to investigators about to begin a 
correlational study.” Block seems to refer to some “higher knowl- 
edge” unavailable to mere research workers. In appraising such an 
assertion, the reviewer is fortunate in having had some experience 
with the “lore” surrounding the MMPI. He spent a full year work- 
ing as a clinical psychologist in a psychiatric hospital in Min- 
nesota, with considerable experience in administering the MMPI 
clinically, and with frequent opportunity for consultation with 
others immersed in MMPI lore. It was as a result of all too fre- 
quent instances of disconfirmation of clinical impressions—of pa- 
tients showing elevated profiles on psychotic scales while otherwise 
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free from such symptoms—that the bases for the response style 
hypothesis was developed. While there may well be something to 
be learned from clinical experience, in the present instance this 
sort of appeal to clinical lore in what is essentially an attempt at a 
psychometric analysis seems inappropriate, and inappropriately 
directed. In a similar vein, Block suggests that other researchers 
have demonstrated “ignorance” and “naivete” without providing 
pertinent citations to research. Furthermore, Block repeatedly sug- 
gests that research with the MMPI ought to be undertaken on ap- 
propriate deviant populations, wherein the large factors might be 
subdivided into more relevant content categories. However, Block 
fails to base his own major analyses on psychiatric samples, but 
instead uses five non-psychiatric samples. Block also overlooks the 
fact that when Jackson and Messick compared factor patterns for 
two deviant samples with that obtained with a college sample, 
the factors they identified as due to response set proved to be larger 
in the deviant groups. 

It is anomalous that a book designed to defend the approach 
used in constructing the MMPI should end with suggestions on the 
construction of- new personality scales. Those familiar with test 
theory and the personality assessment literature will find little 
that is new here. Much of what Block suggests might stem equally 
well from a response set position. The response set position would 
maintain that the first factor is much too large. Block agrees. Its 
pervasiveness should be reduced. Block agrees. Some non-obvious, 
socially neutral items should be used. Block agrees. Item pools 
should be broadened to include dimensions other than the two 
aie oM associated with the MMPI. Block agrees. More at- 
tention Should be given to internal consistency. Block agrees. The 
reliability of differences between scales indicating similar kinds of 
psychopathology is a matter of concern. Block agrees. Perhaps the 
only important respect in which Block’s position departs from that 


of the reviewer is in Block’s faith in empirical item selection. 


against external criteria. Loevinger’s incisive critique of this ap- 
proach, coupled with the sad experience of the MMPI’s failure 
to provide evidence of discriminant validity, has convinced the 
reviewer of the bankruptcy of this approach as it is usually ap- 
plied. Mechanical applications of empirical item selection in an at- 
tempt to develop construct measures of general import have not 
d Mid. Success. 

us Block, in spite of heroically futile and frequently autistic 
efforts to defend the MMPI and to attempt to fault E on iden- 
tifying MMPI response consistency as response style, eventually 
comes around to a position which places him squarely in the camp 
of the MMPI critics. When he has shifted roles from MMPI de- 
fender to that of prognosticator about future personality inven- 
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tories, he recognizes the need for fresh conceptions regarding item 
and scale composition. Whether one prefers the Block interpre- 
tation of the evidence that the MMPI item pool is heavily satu- 
rated with two extremely general dimensions of “ego resilience” and 
“ego control,” or the interpretation that these dimensions are pri- 
marily stylistic in nature, while demonstrating some convergent 
validity, the implications are quite similar. It follows that for the 
critical tasks of differential diagnosis, and of assigning respondents 
to multiple latent dimensions of personality, an item pool such as 
that of the MMPI is both inefficient and unequal to the task. One 
only hopes that future investigators who combine sophistication in 
measurement with appreciation for the complexity of personality 
will not divert their energy in abortive attempts to defend the es- 
sentially anachronistic MMPI but will expend it in the more 
fruitful and challenging task of innovative research in personality 
assessment, 
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Project Talent: One-Year Follow-up Studies by J. C. Flanagan, 
W. W. Cooley, Susan J. Becker, Janet Combs, R. Holdeman, 
P. R. Lohnes, and L. F. Schoenfeldt. Pittsburgh: University of 
Pittsburgh, 1966. Pp. xix + 250 + 100 page Appendix. 


Project, Talent began auspiciously in 1957 with the award of the 
largest research grant given then or since by the U. S. Office of 
Edueation's Cooperative Research Program. With advisory panels 
that read like a Who's Who of American psychology and educa- 
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tion, plans were laid for two days of testing in a representative 
sample of approximately 1000 U. S. high schools. The testing was 
carried out in 1960 when over 2000 test and questionnaire items 
were administered to 400,000 ninth through twelfth grade students. 
Long range plans eall for follow-ups one, five, ten and twenty 
years after the expected date of high school graduation. 

The prodigious first reports of results of this massive assessment 
were aptly characterized by David Tiedeman in a Contemporary 
Psychology review (August, 1964) as attempting simultaneously 
to be both a census and research and consequently doing an ade- 
quate job of neither. Since these early reports, there has been a 
changing of the guard, and of the old crew only Marion Shaycoft 
and the responsible investigator, John Flanagan, remain. With the 
exception of Flanagan all of the authors of the present report 
joined Project Talent after Cooley was appointed Project Director 
in 1964. Thus, this report is of interest not only as the first report 
of follow-up data, but also as an indication of the steadiness of the 
new hands on the helm of this important project. 

While superficially similar to the previous monographs in its 
copious tables and charts, this volume is written at a higher level 
of statistical abstraction and demonstrates a new degree of sta- 
tistical prowess. There is also a verve and goal directedness, par- 
ticularly in the chapters by Cooley, that gives a new sense of 
progress toward more clearly defined objectives. Missing is the 
uninterpreted multiple-page correlation matrix, which had become 
as much a cachet as the word, Talent, written in all capital letters. 
Well over a hundred thousand correlations were computed in the 
course of the analyses presented in this report, but they are no 
more apparent to the reader than the dots making up a half-tone 
photography of Brigitte Bardot. 

The conflict between census and research was decided unequiv- 
ocally in favor of research. The census-type tabulations that are 
presented are explicitly stated to be “only a by-product of the 
research," which consists primarily of the use of multiple discrim- 
inant analysis to study the relationship between the extensive data 
obtained in high school and the educational and vocational cri- 
teria from the one-year follow-up. 

‘The numerous discriminant analyses employing multi-categoried 
criteria, many continuous predictor variables, and adequate sam- 
ples give the best indication yet available of what this powerful 
multivariate technique will and will not do. In this monograph 
discriminant analysis yields three main types of information: (а) 
It gives an indication of the relative ability of various types O 
predietors to discriminate among criterion groups. For example, 
groups of students with different career plans at the time of fol- 
lowup were best discriminated by the ability measures, next by the 
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interest measures and least by the temperament measures, but the 
best predictor of all was the initial eareer choice. (b) It gives a 
measure of the similarity of the various criterion groups in terms 
of the predictor variables. For example, “the students attending 
George Washington University were most like the students at- 
tending the University of Pittsburgh in ability, but the George 
Washington students more closely resembled the students at Cor- 
nell and Temple in interests, and the students at Northwestern 
and Washington Universities in temperament.” (c) It gives an 
indication of the number of independent dimensions along which 
criterion groups can be discriminated. For example, two dimen- 
sions accounted for between 80 and 90 per cent of the discrimi- 
nable variance in most instances. 

What discriminant analysis does not do is aid in the understand- 
ing of the relationships that it so precisely summarizes. The crite- 
rion groups can be plotted on orthogonal axes representing the 
major discriminant functions; but, when one attempts to under- 
stand the meaning of these deceptively simple scatter plots, he 
finds that the axes are conceptually-meaningless statistical arti- 
facts. Those who have chafed at the ambiguity and arbitrariness 
of the results of factor analyses will go mad trying to read mean- 
ing into the discriminant functions, which are rotated not to sim- 
ple structure, not to psychological meaningfulness, but to maximally 
discriminate the groups. The reader who suffers from a growing 
sense of inferiority at his inability to comprehend the meaning 
of discriminant functions described in such glowing terms by the 
authors (e.g. “These results are not only interesting, but extremely 
valuable.”) will be gratified by the tacit admission in the last 
chapter that there is also a hunger for meaningfulness'at Project 
Talent. There the authors announce that for future analyses they 
are reducing their data to factor scores on a set of 13 ability 
factors and 13 motive factors with the explanation: “We think 
that prediction studies from the base of these factors will be much 
easier to interpret than studies based on 98 highly redundant 
scales.” Let us pray that they are right. 

The monograph ends with Project Talent’s Mein Kampf: A plan 
for replacing the “unprofessional” current methods of record keep- 
ing and guidance in the schools with a computer measurement 
system (CMS). Aside from improving efficiency by automating 
school record keeping, the CMS boils down to a scheme whereby 
students will take Project Talent tests and will receive in return a 
statement of their chances of entering various educational pro- 
grams and career fields after high school. Although it is empha- 
sized that the CMS gives only probabilities of future events, not 
advice or coercion; in practice the students can respond in only 
two ways: they can ignore the CMS and go on their way or they 
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can take the implicit advice and work toward a high-probability 
alternative. All previous experience seems to indicate that they 
will do the former, but, while there is still time, perhaps we 
should consider what might result should they do the latter. 

The probabilities reported by the CMS will be based initially 
on the current Project Talent sample, but once the system is opera- 
tional the discriminant equations will be updated by follow-ups of 
succeeding generations of students. Thus, to the extent that stu- 
dents are influenced by the CMS advice they will enter fields 
where, in important respects, they are like others already in the 
field. This will tend to increase predictability on the next round of 
follow-ups, thereby increasing probabilities reported to future stu- 
dents, strengthening the implicit advice, and so on fpr successive 
iterations, If the system works at all, it will in time increase its 
own effectiveness and converge toward a condition where people 
entering various fields are homogeneous with respect to the vari- 
ables in the discriminant equations. 

Increased homogeneity of occupational groups in terms of such 
performance relevant variables as ability or ambition might well 
slow down the rate of adaptive innovation of occupational skills. 
Even moze insidiously, increased homogeneity in terms of vari- 
ables that are unrelated to performance would tend to increase 
existing social inequality. For example, girls in the highest ability 
quarter were reported to have a probability of entering college of 
.82 if they are from a high socioeconomic level and of .34 if they 
are from a low socioeconomic level. What would be the result of 
giving these probabilities, backed by the authority of the CMS, to 
students who are in the process of deciding whether or not to enter 
college? 

The characteristics of the people in various fields are now deter- 
mined by an intricate system of incentives and social gates. The 
CMS would attempt to predict the action of this ever-evolving 
system and to usurp its function by steering students into the paths 
that offered the least resistance to similar students of the last 
generation. Thus, the CMS is inherently conservative and unre- 
sponsive to changing social needs. 

There are advantages and accompanying dangers of a central- 
ized advisory system through which social planners might influ- 
ence the manpower supply in various fields to meet anticipated 
needs, but the CMS as currently envisaged is backward looking 
and incorporates no such foresight. It would, for example, dispas- 
sionately discourage able students from becoming teachers because 
the typical student recruited into education in the past has been 
at the low end of the college ability range. In the face of such 
patently insalubrious out-put, the temptation to tamper with the 
latent vectors would be great. 
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This monograph lacks polish and is incomplete in many ways 
as is to be expected of a progress report of an ongoing project, but 
it should whet the reader’s appetite for future reports, which are 
promised on a seemingly impossible time schedule. These future 
reports will bear watching because, with the new level of techni- 
cal sophistication and the new elan at Project Talent, it is not un- 
likely that they will achieve at least some of their grandiose aims. 


Ковевт C. NICHOLS 
National Merit Scholarship Corporation 


Basic Concepts of Measurement by Brian Ellis. Cambridge, Eng- 
land: Cambridge University Press, 1966. Pp. 220. $8.50. 


Serious consideration of the nature of measurement probably 
began (somewhat more directly than usual) with Plato and his 
theory of forms. It received its greatest impetus from nineteenth 
century physics, but got no systematic treatment other than 
Mach’s, which was confined to the area of temperature measure- 
ment, before the works of Campbell and Bridgman early in this 
century. No important attempts at generalization to the social and 
behavioral sciences except Stevens’ work and the contributions of 
Coombs and Luce have been advanced. Ellis, who is Senior Lec- 
turer in History and Philosophy of Science at the University of 
Melbourne, undertakes a comprehensive treatment of the philosoph- 
ical nature and status of the entire concept of measurement. 
He speaks from a position that is empiricist, positivist, and for- 
malist, perhaps in that order. He holds that relatively little re- 
cent work has been done, not because the various older theories 
are self-evident and satisfactory, but because no serious thought 
has been devoted to the problem. Ellis' own approach is not his- 
torical, although his opening discussion bows briefly toward Plato, 
and a section of Mach’s writings on temperature theory is in- 
cluded in an appendix. Ellis proceeds in a more building-block 
fashion which avoids tiresome, historieal blind alleys, and pro- 
vides a clear account of the issues now at stake. 

The author begins with the distinction between pure and ap- 
plied arithmetic. He reaffirms the distinction, tending to cast pure 
arithmetic into a very formalist mold and to subdivide applica- 
tions of arithmetic into those which are involved in setting up 
scales and those which follow applications of the scales to empiri- 
cal problems, More technically, the former usually yield analytic 
propositions from the meaning of unit names, while the latter 
generally produce synthetic propositions. > 

Ellis then proceeds to examine the concept of “quantity.” He 
concludes that there are no “absolute” quantities (e.g., absolute 
mass), but that all quantities have a basically relational charac- 
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ter and are logically dependent upon a set of linear ordering rela- 
tionships. Given a quantity, a scale of measurement for the quan- 
tity can be constructed. But legitimate scales can be constructed 
that differ in a variety of ways. The two major procedures for 
classifying scales can be traced to Campbell and to Stevens, 
Campbell’s system depends on the kinds of measuring procedures 
employed and Stevens’ on the mathematical properties of the 
scales. No matter how they are classified, it is a metaphysical er- 
ror to speak of “true” scales. Ellis feels that Campbell’s approach 
produces more powerful insights into the nature of measurement, 
but that Stevens’ work may be of more practical value in such 


Derived measurement is Performed on quantities via numerical 
laws that relate the results of measurements of two or more quan- 
tities. An example of such a quantity is pressure. Once more, Ellis 
feels that conventions involving derived measurement are largely 
arbitrary whereas the real criterion should be formal simplicity. 
He demonstrates that the “universal constants” that occur in 
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As something of a tour de force, the author closes with an 
analysis of two special quantitative concepts: number and prob- 
ability. He demonstrates that number is a very special concept in 
measurement which exists on the basis of necessary ordering rela- 
tionships. Its measuring operation is counting, which seems to lack 
completely the arbitrariness of most measuring procedures, Ellis 
formalizes and reaffirms the concepts of logieal and empirical prob- 
ability and shows that the latter is essentially a quantitative con- 
cept and shares the attributes of such concepts. 

Although this book concentrates on philosophical positions and 
issues, the ideas it develops are of sufficient power and importance 
that anyone interested in measurement should have an introduc- 
tion to them. Fortunately, Ellis’ style makes this book much easier 
to read than most philosophieal works. In particular, the empiri- 
cist flavor he imparts and his excellent; and imaginative use of ex- 
amples—of which there are fewer from the behavioral sciences 
than one might wish—keep his arguments coherent and meaning- 
ful. The text is liberally weighted with mathematico-logical defini- 
tions and statements of relation, but these are largely at the level 
George Miller calls "diseursive" and will hamper the entree of 
few interested persons. The book is generally well organized and 
well written, and will certainly provide a take-off point for dis- 
cussions of the nature of measurement for many years to come, 


JAMES А. WALSH { 
Iowa State University 


Nursing Evaluation: The Problem and the Process, The Critical- 
Incident Technique by Grace Fivars and Doris Gosnell. New 
York: Macmillan Co., 1966. Pp. xii + 228, $5.95. 


Once in a while the research worker in measurement and evalua- 
tion comes across most unexpectedly a golden nugget in an area 
where he may have expected not to find one. Such is the case in 
the instance of the noteworthy little book prepared by Grace 
Fivars and Doris Gosnell who have done for nursing evaluation 
what only a handful of other specialists in education and psychol- 
ogy have achieved during the past three decades in the mainstream 
of curriculum in elementary, secondary, and higher education. 

It may well be that the unique difficulties that are faced in the 
evaluation of nursing success required the extra effort on the part 
of the two able authors of this book—effort which has paid off in a 
handsome way. In slightly more than two hundred pages this 
highly substantive volume covers more relevant and helpful mate- 
rial in the implementation of objectives of instruction and learning 
than do most books in educational and psychological measure- 
ment and evaluation that are two to three times as long. Despite 
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the somewhat specialized emphasis in the use of critical incident 
in nursing evaluation, the transfer value of this volume is exceed- 
ingly high, as substitution of another word wherever the word 
nursing" appears would make most of the contents applicable not 
= to the — of other professional disciplines, but also to 
most programs in hi education, to selection and training enter- 
prises in the civil and wr services, and to the in-service train- 
ing programs of business and industry. 

Although three of the first four chapters are primarily concerned 
with the critical ineident as an approach to problems in nursing, 
with eritieal requirements as an approach to establishing objec- 
tives in a school of nursing, and with the critical incident as a ba- 


and Tyler is particularly apparent. Bont upon the evaluation 
process in terms of the task is developed in Chapter 7, in which 
the imprint of Gagné is clearly apparent, In Chapter 8, the familiar 
tools of evaluation—paper-and-pencil tests, questionnaires, and 
interviews along with observational techniques—are described, 
and appropriate sections on interpreting results of evaluation and 
on communienting their significance to students are included. Con- 
cerned with evaluation in terms of performance, the ninth chapter 
clearly sets forth detailed illustrations of performance records 
along with numerous examples of how to record a critical incident. 
Ineluded in c same ehapter are twelve behavior arens which are 


ETT “Critical Requirements for Selected 
Fields," Student Behavior in Selected Clinical Areas,” 
and “Performance Descriptions” offer valuable aid to the research 
workers in medical and nursing education who wish to set up evalu- 
ation programs. Throughout the text, one finds at the close of 
each chapter highly useful bibliographic entries which one may 
consul Sar орет process. 

In summary, this book affords a streamlined version of the 
evaluation process that is of enormous value to both research work- 
ers and practitioners who are concerned with the development of 
evaluation procedures for school programs. It also contains mate- 
rial of considerable signifieance to the classroom teacher who is 
interested in improving evaluation processes. This volume should 
serve аз а landmark for many years to come. 

Warm B. MICHAEL 
University of California, Santa Barbara 
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Better Classroom Testing by Frank Р, Gorow, Ban Francisco: 
Chandler Publishing Co., 1966. Pp. viii -+ 99. 
This small book is designed to aid the teacher in 
tests that will achieve а given purpose more precisely, and 
measure more than "knowledge alone," The author hopes to 


achieve this end by promoting test analysis, and by the 
reader “capture some of the expertness of those who in 
test construction,” 


The book is very simply written with a minimum of technical 
language. This is likely the reason for the repeated lack of nm 
sion which characterizes many statements, especially of а 
tional nature. The following examples are 
room test measures what has been taught (or should have 
taught).” Construct validity “refers to the kinds of learning крові» 
fied or implied in the objectives.” With a reliable test “each stu- 
dent will maintain about the same relative rank in his group (or 
achieve about the same test score) each time he takes the test,” 

to 


The author's attempt to get test makers testa to more 
than a о Маайыс incen чуг His èf- 


forts also to encourage test makers to include more than a single 
kind of achievement—factual knowledge—in their teste is com- 
mendable, However, the author's categories of knowledge, com- 
prehension of concepts and meanings, com of complex 


ideas, and application or transfer appear clearly inexhaustive in 
the light of more complete classifications of educational objectives. 
This reader feels that achievement is best classified under a more 
Lew system, such as Bloom's Taxonomy of Educational Ob- 
tives, 
With the descriptions of how to construct various kinds of items, 
the author EE CORE © qM. Тун оа E 
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excellent one; his mode of execution, however, is less than excel- 
lent. 


Сыхтох I. CHasp 
Indiana University 


Mathematical Explorations in Behavioral Science by Fred Mas- 
sarik and Philburn Ratoosh (Editors). Homewood, Illinois: 
Richard D. Irwin, Inc., 1965. Pp. viii -+ 387. $7.95. 


This volume is an outgrowth of an inter-disciplinary symposium 
held in 1961. The invited emphasis was on conceptual innovation 
supported by data, and the contents (indeed, the format and tone) 
of the book are true in reflecting this emphasis. The editors have 
purposefully included the term Explorations in the title, reflect- 
ing their belief that “... many of the exciting and promising ap- 
proaches reported still are highly tentative. . . .” An auxiliary pur- 
pose of the volume was to emphasize the importance of a bridge 
between the abstract mathematical model and the empirical data 
“. .. needed to breathe life into the model.” The emphasis of the 
symposium resulted in the inclusion of a great variety of papers 
(from both substantive and analytic points of view), and this was 
intended as a useful feature, illustrating the considerable differ- 
ences in mathematical strategy and empirical completeness char- 
acteristic of the various fields. Almost all of the papers are a part 
of ongoing research programs at various levels of sophistication. 
As expected with such a publication, many (about half) of the 
chapters are published elsewhere. 

In addition to the introductory chapters, there are four major 
sections which include substantive interests in the general aca- 
demic fields of Psychology (Chapters 8-14), Sociology (Chapters 
15-21), and Economics (Chapter 22), as well as a section dealing 
with philosophy and method in the behavioral sciences (Chapters 
4-7). The following comments summarize briefly each chapter and 
its contribution, 

In Chapter 4, Churchman proposes a “World Information Cen- 
ter” with the purpose of collecting, cataloguing and publishing con- 
clusions concerning every “significant research document”. Such a 
monumental task he considers a part of mathematics, broadly de- 
fined. 

Hunt (Chapter 5) proposes a technique of multi-model evalua- 
tion based on procedures of Bayesian statistical inference, The 
final credibility of a given model is an estimate of relative pref- 
erence for the model, given the data of a series of experiments. The 
author deals with the question of assignment of a priori credibili- 
ties, as well as with that of possible decision rules to determine 
when to stop experimentation and to accept one model as “the best” 
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among those being evaluated. Examples of the application of the 
evaluation technique were chosen to exhibit the advantages and 
pitfalls of the procedure. 

Harrah (Chapter 6) develops a highly complex logical model to 
deal with the question of how the rational receiver behaves in cer- 
tain communication situations, and comments on potential empiri- 
cal approaches to the problem. The many definitions and theorems 
introduced make this chapter quite difficult reading. 

In his non-mathematical paper (Chapter 7) Scheff discusses the 
medical norm, “when in doubt, diagnose illness”, as an instance of 
a decision rule for guiding behavior under conditions of uncer- 
tainty. He concludes that the usual scientific aversion for Type I 
as opposed to Type II errors is inappropriate: in many cases, diag- 
nosing illness when there is none may do great harm. He proposes 
also considering Type II error in medical diagnostics by comput- 
ing the “expected value” of a treatment. There are, of course, prob- 
lems of probability and utility estimation, and the problem of com- 
paring incomparables on a single continuum (“How does one 
weigh the risk of death against the monetary cost of treatment?”). 
This paper represents a refreshing theoretical application of a 
principle of statistical theory to problems of practieal conse- 
quence. 

Attneave’s chapter (Chapter 8) is a brief non-mathematical sum- 
mary of some ongoing research and of preliminary results regarding 
pattern perception in sequential stimulation. 

In his paper (Chapter 9) Miner defines conformity as adher- 
ence to a group norm (determined statistically) in the Tompkins- 
Horn PAT. This definition is considered more representative of 
conformity in everyday life than that defined by the Asch situa- 
tion. The procedure is considered as a possible operational defini- 
tion for a group, and evidence is discussed for relationships be- 
tween conformity and task performance, intelligence, and age. It 
is worth noting the author’s comment that at present, we have “no 
adequate basis for differentiating the important dimensions of 
conformity.” 

Colby’s very speculative paper (Chapter 10) has as its stated 
purpose the comparison of stochastic and computer models in ex- 
perimental psychotherapy research. It is very brief and does not 
seem to accomplish the stated purpose well. 

In Chapter 11, Lieberman investigates the descriptive value of 
mathematical game theory. In general, he concludes that only 
when the interaction situation is simple enough for mathematical 
theories to prescribe precisely a form of rational behavior, e.g. the 
minimax prescription in games with a saddle point, do individuals 
tend to behave in the prescribed way. In game situations in which 
bargaining can occur, other considerations, e.g. equity, seem to be 
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quite important. The general gaming approach is concluded (quite 
correctly) to offer an abstraction of important elements of social 
conflict which may be quite useful as a general experimental 
paradigm. б 

In Chapter 12, Bartos briefly develops and applies a stochastic 
model to the negotiation process. Part of the model is formally 
identical with Bush and Mosteller's stochastic model of learning. 
There are however, a number of simplifying assumptions and 
rather arbitrary conventions, and as a result, the discussion is 
quite tentative. Comparison of the model's predietions with those 
of three other models provides some basis for evaluation. The re- 
sults, however, are inconclusive, and they indicate a need for more 
analytic development as well as more refined experimentation. 

Guetzkow (Chapter 13) describes his recent work in simulation 
of international relations. The simulation is a set of programmed 
assumptions regarding the basic operation of “core variables”, and 
a multitude of free variables which are allowed to develop through 
the interaction of the small groups of individuals which repre- 
sent nations, Many of the difficulties inherent in such an approach 
are recognized, and the importance of work on the validity of the 
simulation is regarded as imperative. 

Solomon’s paper (Chapter 14) presents a probability model 
for group performance in free recall of verbal material, which de- 
pends only on the group members’ individual scores, The estima- 
tion of two individual parameters from the data enables estima- 
tion of group parameters, and thus allows a test of the model. The 
author concludes with the wise call for analytic work on the char- 
acteristics of the relevant sampling distributions, as well as for a 
general revision of the model, before any new experimentation be- 
gins, 

In Chapter 15 Coleman presents some interesting analytical work 
on the process of diffusion in incomplete social structures. He 
writes differential equations of both a deterministic and stochas- 
tic variety, and investigates some points which allow certain in- 
teresting qualitative deductions. The models, however, are only a 
start on the general problem. 

Chapter 16 is a loose study of organizational functioning (a 
stock exchange and a library) with specific reference to communica- 
tion problems and to their relation to efficient and proper function- 
ing and development. The paper seems considerably too long, liter- 
ary, and redundant. Perhaps the orientation of the discussion in 
the author’s postscript would have better served as the main por- 
tion of his paper. In his discussion of Meier’s paper, Churchill 
(Chapter 17) handles the problems and suggested solutions in a 
much more succinct, direct, and readable manner. 

In Chapter 18, Catton investigates an analogy to mass in a so- 
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ciological version of gravitation. Catton studies three models and 
their relative fit to data obtained from National Park visitation 
rates. Embedded in the paper is a discussion of the explanatory 
value (power?) of the "physiealistie" as opposed to the “ver- 
stehende" approach to sociology. A wedding is proposed. A brief 
reading makes clear a point that Huff (Chapter 19) states in his 
critique: rigid definitional analogy between the physical and social 
sciences is not likely to be fruitful. Huff's proffered probabilistic 
gravity model (employing the notion of utility as well as probabil- 
ity) appears more promising. 

Mandelbrot (Chapter 20) considers the empirical distribution 
of city sizes, and proposes a stochastie model to account for his 
data. The extent to which such a model describes population 
change is seen to be remarkably accurate in cases where the dis- 
tribution of the populations considered is markedly non-normal. 

Lazarsfeld and Henry's chapter (Chapter 21) is probably not 
the best introduction to Latent Structure Analysis, but it does 
present the basic skeleton of his model from both a historical and 
2 mathematieal viewpoint. Two discrete class models are pre- 
sented (an appendix shows the solution of the Latent Profile 
Model) as well as an application to data. It seems a bit strange, 
however, that this symposium includes Lazarsfeld's approach but 
omits more recently developed areas, e.g. Coombsian nonmetric 
scaling. 

Arrow's chapter seems fairly typical of many in mathematical 
economics. It is a mathematical development of the hypothesis 
that technical change in general can be ascribed to experience, 
and is based on a great number of assumptions regarding pro- 
duction processes, capital goods, and the like. Its complexity makes 
it difficult reading; however, it is recommended to the interested 
and competent reader. 

There are several general comments regarding this volume which 
should be noted. The first has to do with the introductory section. 
Chapters 1 and 3 present a quick view of the purpose of the book 
as well as cursory summaries of the contents of the remainder of 
the chapters. In Chapter 2, one of the editors (F.M.) reviews 
and evaluates various subcultures of mathematics in the behav- 
ioral sciences. His interdisciplinary flair is also evident in the 
poetic introductions to each of the five major parts. Though the 
poetry does not merit acclaim, it does, perhaps, aid in communi- 
cating the interdisciplinary nature of the book, and serve to reas- 
sure the reader of the important fact that mathematics need not be 
forboding and sterile to empirically oriented researchers. In gen- 
eral, however, the poetic introductions, as well as Chapter 2, are 
poor attempts at spicing up the mathematical approach to the be- 
havioral sciences. 
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quite important. The general gaming approach is concluded (quite 
correctly) to offer an abstraction of important elements of social 
conflict which may be quite useful as a general experimental 
paradigm. A 

In Chapter 12, Bartos briefly develops and applies a stochastic 
model to the negotiation process. Part of the model is formally 
identical with Bush and Mosteller’s stochastic model of learning. 
There are however, a number of simplifying assumptions and 
rather arbitrary conventions, and as a result, the discussion is 
quite tentative. Comparison of the model’s predictions with those 
of three other models provides some basis for evaluation. The re- 
sults, however, are inconclusive, and they indicate a need for more 
analytic development as well as more refined experimentation. 

Guetzkow (Chapter 13) describes his recent work in simulation 
of international relations. The simulation is a set of programmed 
assumptions regarding the basic operation of “core variables”, and 
a multitude of free variables which are allowed to develop through 
the interaction of the small groups of individuals which repre- 
sent nations. Many of the difficulties inherent in such an approach 
are recognized, and the importance of work on the validity of the 
simulation is regarded as imperative. 

Solomon’s paper (Chapter 14) presents a probability model 
for group performance in free recall of verbal material, which de- 
pends only on the group members’ individual scores. The estima- 
tion of two individual parameters from the data enables estima- 
tion of group parameters, and thus allows a test of the model. The 
author concludes with the wise call for analytic work on the char- 
acteristics of the relevant sampling distributions, as well as for a 
general revision of the model, before any new experimentation be- 
gins. 

In Chapter 15 Coleman presents some interesting analytical work 
on the process of diffusion in incomplete social structures. He 
writes differential equations of both a deterministic and stochas- 
tic variety, and investigates some points which allow certain in- 
teresting qualitative deductions. The models, however, are only a 
start on the general problem. 

Chapter 16 is a loose study of organizational functioning (a 
stock exchange and a library) with specific reference to communica- 
tion problems and to their relation to efficient and proper function- 
ing and development. The paper seems considerably too long, liter- 
ary, and redundant. Perhaps the orientation of the discussion in 
the author’s postscript would have better served as the main por- 
tion of his paper. In his discussion of Meier’s paper, Churchill 
(Chapter 17) handles the problems and suggested solutions in a 
much more succinct, direct, and readable manner. 

In Chapter 18, Catton investigates an analogy to mass in a so- 
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ciological version of gravitation. Catton studies three models and 
their relative fit to data obtained from National Park visitation 
rates. Embedded in the paper is a discussion of the explanatory 
value (power?) of the “physicalistic’ as opposed to the “уег- 
stehende” approach to sociology. A wedding is proposed. A brief 
reading makes clear a point that Huff (Chapter 19) states in his 
critique: rigid definitional analogy between the physical and social 
sciences is not likely to be fruitful. Huff’s proffered probabilistic 
gravity model (employing the notion of utility as well as probabil- 
ity) appears more promising. 

Mandelbrot (Chapter 20) considers the empirical distribution 
of city sizes, and proposes a stochastic model to account for his 
data. The extent to which such a model describes population 
change is seen to be remarkably accurate in cases where the dis- 
tribution of the populations considered is markedly non-normal. 

Lazarsfeld and Henry’s chapter (Chapter 21) is probably not 
the best introduction to Latent Structure Analysis, but it does 
present the basic skeleton of his model from both a historical and 
a mathematical viewpoint. Two discrete class models are pre- 
sented (an appendix shows the solution of the Latent Profile 
Model) as well as an application to data. It seems a bit strange, 
however, that this symposium includes Lazarsfeld’s approach but 
omits more recently developed areas, e.g. Coombsian nonmetric 
scaling. 

Arrow’s chapter seems fairly typical of many in mathematical 
economics. It is a mathematical development of the hypothesis 
that technical change in general can be ascribed to experience, 
and is based on a great number of assumptions regarding pro- 
duction processes, capital goods, and the like. Its complexity makes 
it difficult reading; however, it is recommended to the interested 
and competent reader. 

There are several general comments regarding this volume which 
should be noted. The first has to do with the introductory section. 
Chapters 1 and 3 present a quick view of the purpose of the book 
as well as cursory summaries of the contents of the remainder of 
the chapters. In Chapter 2, one of the editors (F.M.) reviews 
and evaluates various subcultures of mathematics in the behay- 
ioral sciences. His interdisciplinary flair is also evident in the 
poetic introductions to each of the five major parts. Though the 
poetry does not merit acclaim, it does, perhaps, aid in communi- 
cating the interdisciplinary nature of the book, and serve to reas- 
sure the reader of the important fact that mathematics need not be 
forboding and sterile to empirically oriented researchers. In gen- 
eral, however, the poetic introductions, as well as Chapter 2, are 
poor attempts at spicing up the mathematical approach to the be- 
havioral sciences. 
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A second point is that the general organization of the book 
could be better. There is a conspicuous lack of the format char- 
acteristic of published symposia: there is almost no transcription 
of the discussion, comments, and criticisms generated by the read- 
ing of the respective papers. This is a sad omission from a didactic 
viewpoint, and tends to leave the reader considerably less in- 
volved. It seems strange, also, that neither the biographical nor the 
literary style of writing was employed exclusively. In addition, it 
would seem that the inclusion of well-written summaries for each 
article would increase the quality of the whole. 

The volume is intended as a potential text or supplementary 
reading source in courses on research methods in the behavioral 
sciences. However, the heterogeneity of the papers (which is diffi- 
cult for the reviewer to justify) may restrict its use in this way, 
since the materials and methods of interest in any specific program 
are likely to be fairly circumscribed in nature. In addition, it is the 
reviewer’s distinct impression that, from both substantive and ana- 
lytic points of view, the papers included are less than represen- 
tative of current “mathematical explorations” in the behavioral 
sciences. 

Finally; the lag of four years (1961-1965) between the occur- 
rence of the symposium and publication of the proceedings seems 
inordinately long. This point is especially important when we note 
that the editors consider the papers to “. . . represent snapshots in 
a rapidly moving picture.” 


Steven P. McNxEL 
University of California, Santa Barbara 


Fortran Programming of Electronic Computers by Harry P. 
Hartkemeier. Columbus, Ohio: Charles E. Merrill Books, Inc., 
1966. Pp. 200. 


The purpose of this textbook is to provide an introduction to 
Fortran programming and the processing of such programs on the 
IBM 1620 computer. The book is evidently intended to serve as an 
instructional aid in support of classroom lectures, diseussion, dem- 
onstration, and practice in computer use. 

The approach is simplified, with no more than a background in 
high school algebra required for understanding. Discussion of ma- 
chine language is avoided. Each point is clearly explained, so that 
students who feel uncomfortable in dealing with even the more 
simple technical and scientific concepts should be able to study 
the book without undue difficulty. 

The large selection of examples and illustrations will likely be 
useful in helping students grasp the concept of man-machine com- 
munications. There is enough variety of sample problems to fit 
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the interest and needs of students with diverse backgrounds. For 
example, programs are presented to obtain: a summary of labor 
costs, grade-point averages of students, solutions to quadratic 
equations, and batting and slugging averages of baseball players. 

An interesting feature of this book is that it provides a fairly 
large number of problems on pages which are perforated so that 
they may be torn out by the student and handed in to the instruc- 
tor. The student is presented with objective and essay items on 
these sheets as well as assignments to develop various computer 
programs. Detailed steps in machine operation are also listed on 
perforated pages and may be conveniently removed from the text- 
book for ready reference when the student is working on the com- 
puter. 

This book is somewhat lacking in generality because it is so 
specifically geared to the IBM 1620 computer. As the 1620 becomes 
less used, the textbook will probably be of less value. It should be 
mentioned also that some minor reading difficulty is presented in 
that figures are sometimes rather widely separated from pertinent 
textual material. 

Instruetors in introductory Fortran would be well advised to 
examine this book for possible classroom use and to eompare it 
with others which are available. However, there will obviously be 
difference of opinion on the value of the book according to the in- 
structor’s objectives and the level of sophistication of the computer 
available to the students. 


QUENTIN STODOLA AND 
Gary H. CARPENTER 
California State College 
Dominguez Hills 


Annual Review of Psychology (Volume 16, 1965) by Paul Farns- 
worth, Olga MeNemar, and Quinn MeNemar (Editors). Palo 
Alto, California: Annual Reviews, Inc., 1965. Pp. x + 571. 
$8.50. 


For the audience of EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT three contributions in the 1965 edition of the Annual Re- 
view of Psychology would appear to be of particular interest: 
(a) “Human Abilities” by George Ferguson, (b) “Personnel Selec- 
tion” by S. Biesheuvel, and (c) “Scaling” by Gusta Ekman and 
Lennart Sjoberg. Although thirteen other noteworthy sections are 
included, these three will be the only ones considered. 

Covering the period from 1960 to 1964, Ferguson's review of re- 
search on human abilities considers (a) general theories, (b) he- 
redity and human ability, (c) the development of abilities, (d) the 
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organization of human abilities, (e) the relationship between 
human learning and ability, (f) ethnie comparisons, (g) effects of 
aging on abilities, and (h) several miscellaneous topics such as the 
neurologieal bases of intelligence, bilingualism and intelligence, 
ereativity, and intelligence in relation to season of birth. Drawing 
on the contents of 142 articles, Ferguson not only reported factual 
outcomes, but also indicated what he believed to be directions 
that research was taking. Two highly pertinent concerns were 
raised relative to what contribution faetor analysis can make to 
the development of a comprehensive theory of behavior and rela- 
tive to what has been a lack of relating organizational characteris- 
ties of abilities to the processes whereby they develop in the child. 

Consisting of 150 articles, most of which appeared between 1961 
and 1964, the review by Biesheuvel was concerned with (a) selec- 
tion and validation models, (b) the criterion problem, (c) per- 
sonality measurement, (d) holistie approaches to selection such as 
the “in-basket” procedure, situation tests, biographical inven- 
tories, and other measures, (e) specific studies in such distinctive 
areas as engineering apprenticeship, motor vehicle performance, 
service in foreign countries, and military training, (f) personnel 
selection in developing countries, and (g) current status of re- 
search in personnel selection. This last section is of great interest 
in terms of the conclusion reached that progress has been disap- 
pointing during the past several years. There is the suggestion that 
although the most effective results have been obtained from pro- 
grams specifically devised to meet needs of certain situations, the 
pressing problem is to view selection within a broad context of 
procedures of recruitment and counselling, the specification of job 
requirements, the environments of work organization, and the 
characteristics responsible for determining motivation within this 
broad framework of interrelated elements. 

The comprehensive discussion on scaling, which makes reference 
to 124 sources in the literature primarily for the 1961-1963 period, 
is concerned with (a) general methodological problems, (b) Thur- 
stonian methods, (c) direct estimation methods especially within 
the framework of S. S. Stevens’ contributions, (d) comparative 
studies of various sealing procedures, (e) multidimensional scaling 
models, and (f) psychophysical law in relation to scaling methods. 
Ekman and Sjoberg's contribution constitutes a useful integration 
and synthesis of much recent work in an active area of quantitative 
methodology. 

In short, these three contributions furnish an authoritative and 
at the same time heuristic treatment of important problem areas in 
educational and psychological measurement. The writers are to be 
commended for the care which they have taken in selection of sig- 
nificant research studies and for the high level of scholarship 
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which they have exhibited in synthesizing substantial bodies of 
published research. 


WinLiAM B. MICHAEL 
University of California, Santa, Barbara 


Annual Review of Psychology (Volume 17, 1966) by Paul Farns- 
worth, Olga MeNemar, and Quinn MeNemar (Editors). Palo 
Alto, California: Annual Reviews, Inc., 1966. Pp. x + 589. 
$8.50. 


Although there are seventeen separate topical contributions in 
the 1966 edition of the Annual Review of Psychology, only one 
appears to be directly pertinent to the interests of the readers of 
EDUCATIONAL AND PsycHoLocicaL MEASUREMENT; namely, the 12- 
page section “Statistical Theory” by Rosedith Sitgreaves of 
Teachers College, Columbia University. In this highly abbrevi- 
ated review of statistical literature, which covers the period from 
May 1, 1963 to April 30, 1965, the author chose five papers which 
she thought to be representative of current thinking and activity 
in statistics from a list of 204 references. The bibliography has 
been deposited as Document number 8575 with the ADI, Auxiliary 
Publications Project, Photoduplication Service, Library of Con- 
gress, Washington, D. C. 

Of the five references selected, the first two—one by H. Cramer 
on the model building with assistance from stochastic processes 
and the other by J. W. Pratt, H. Raiffa and R. Schlaifer con- 
cerned with the foundations of decision theory under uncertainty 
—are expository papers that set forth an overview of important 
contributions and thinking in essentially new areas appearing 
since World War II. Sitgreave’s review of these two papers is clear, 
informative, and entertaining. 

The other three papers furnish new statistical techniques for 
specific types of problems. In one paper R. P. Abelson and J. У. 
Tukey suggest a way for selecting a single contrast in testing the 
null hypothesis (as in a one-way analysis of variance) of the 
equality of means when one has reason to believe on the basis of a 
limited amount of information that an alternative hypothesis of a 
simple rank order of n population means can be specified along 
some dimension. Of considerable interest to the sociologist and 
psychologist is the paper by L. A. Goodman that is concerned 
with the analysis of three-factor interaction in contingency tables. 
In the last paper to be considered, E. A. Paulson set forth a statis- 
tical procedure for choosing one of a specified number of popula- 
tions with the largest mean. 

The reader may question the advisability of a review author's 
limiting the coverage of a given section to only five articles when 
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more than 204 references were isolated in an initial survey of the 
literature. Although there is the advantage of a consideration of a 
small number of articles in relatively great depth, there is the ac- 
companying undesirable loss of necessary breadth of coverage for 
which the typical reader of the Annual Review of Psychology may 
be looking. Despite this marked limitation, Sitgreaves has ren- 
dered a highly competent and scholarly treatment of the five 
articles which she chose to review. 


ҮтилАм B. MICHAEL 
University of California, Santa Barbara 


Patterns of Redundancy by A. C. Staniland. Cambridge, England: 
Cambridge University Press, 1966. Pp. viii + 216. $8.50. 


One idea that has dominated the use of information theory in 
psychology is that in many tasks, human performance appears to 
be intimately tied to the pattern of information, or redundancy, of 
a stimulus situation. Staniland’s book deals with this notion in 
the general areas of human learning, memory, and perception. 

As in other ideas of information theory, the concept of redun- 
dancy is not always a cooperative one for people interested in 
evaluating human performance. According to Staniland, the major 
difficulty seems to be that because psychological experiments 
rarely conform adequately to the informational model adopted by 
communications engineers (the I.R.E. Standards), there is a need 
to re-examine the model assumptions, and if possible, to revise 
them to fit human processes more closely. His monograph of 21 
brief chapters is devoted to this task. In the initial chapters, he 
examines some typical experimental procedures—chiefly those of 
P. M. Fitts and С. A. Miller—in which the I.R.E. measure of re- 
dundancy fails to capture aspects of stimulus patterns that may be 
important for performance. He shows in detail how a redundancy 
measure based upon only nominal uncertainty may lead to an in- 
appropriate assessment of an experiment, and more importantly, 
how alternative measures can be constructed. 

The problem of defining and measuring redundancy is not new. 
Readers familiar with Garner's Uncertainty and Structure as Psy- 
chological Concepts, which contains perhaps the clearest discus- 
sion of the problem, will find both old and new proposals in 
Staniland’s book, but they will not find them as easily. Staniland 
seems overly concerned to explicate his proposals with appeals 


to the work of others. The result is that some of his most useful 


ideas are buried in a discussion that, he acknowl di 
to Attneave, Bartlett, Garner, and others. Ee 


, The assumption behind Staniland's analysis is that an informa- 
tional measure ought to be sensitive to the pattern of stimulus 


BOOK REVIEWS 237 


presentation, to schemes of observation, and to strategies of re- 
call, all of which may bear only slight relation to the logie an ex- 
perimenter uses to select stimulus materials. Although the analysis 
deals with specific stimulus patterns, his concern is to discover 
the informational structure that a subject might impose on the 
experiment, rather than an efficient coding system or a purely 
normative description. The stimuli considered usually have a 
multidimensional character. An ensemble of such stimuli has an 
informational structure which can be decomposed in terms of con- 
tingencies and interactions between variables. Starting with this de- 
composition, Staniland constructs alternative measures of redun- 
dancy, which turn out to be generalizations of the I.R.E. measure, 
' taking into account variations in the length and symbol content of 
a “message.” These generalizations emphasize the pattern of in- 
formation processing, rather than the amount of it, and in later 
chapters these pattern concepts are extended to less conventional 
stimulus situations. His thorough examination of redundancy 
measures in relation to human performance should give the re- 
searcher some useful approaches for the assessment of tasks and of 
individuals. Those interested simply in new interpretations of in- 
formation concepts will not be disappointed. One example is his 
unconventional treatment of interactions, in which he suggests а 
role for negative interaction terms, and a performance measure 
that relates interaction to redundancy. 

Staniland writes his own book review in the first chapter, and 
appends tables of LOGsn and —p logsp that are helpful in work- 
ing out his illustrations. Mathematical and computational details 
have been kept to a minimum and usually appear separately at 
the ends of chapters. Some readers would have appreciated a more 
complete mathematical treatment, especially some consideration of 
the statistical sampling properties of the measures. 

Although the book is divided into small chapters, they are not 
all so separately comprehensible as one might wish. The principal 
barrier to quick rapport with the text is the complex subject 
matter, but Staniland adds to this with somewhat unconyentional 
language (e.g., “determinable” is used instead of “stimulus dimen- 
sion” or “variable”). This is expecially unfortunate for a mono- 
graph that contains as many useful ideas as Staniland’s does. 


W.D.LaREIN | — 
University of Illinois 


Learning and Human Abilities, Educational Psychology, by Her- 
bert J. Klausmeier and William Goodwin. New York: Harper 
and Row, Publishers, 1966. Pg. xxi + 720. 


This is the second edition of a textbook published in 1961. The 
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authors state in the preface that “The main variables affecting 
efficiency of learning in the elassroom or in other group settings are 
treated in detail as are other conditions essential to efficient 
learning." 

The text is divided into an introductory chapter and four major 
parts. Part I is titled *Components of the Teacher-Learning Situa- 
tions.” Part II is titled “Achieving Learning Outcomes Efficiently.” 
Part III is titled “Desirable Conditions for Learning.” Part IV is 
named, “Evaluation and Transfer.” 

The material is generally presented inductively. Each chapter 
includes a fairly comprehensive introduction and ends with a con- 
cise summary. 

The conceptual rationale for the new edition is stated as follows: 
‘Recently, more attention is being given to improving efficiency of 
learning and related development of human abilities. . . . We have 
pioneered in setting forth a functionalist theory formulating nine 
models of instruction, each dealing with more specific learning 
outcomes and phenomena: factual information, concepts, prob- 
lem solving, creativity, psychomotor abilities and skills, attitudes 
and values, personality integration, motivation, retention and 
transfer.’ These models have not been revised very extensively 
in their organization, However, the up-dating of the reference 
materials in these parts was fairly extensive. 

The authors report that they have sought and evaluated the re- 
actions of about 100 instructors who used the first edition. They 
used these opinions to support the emphasis in the second edition 
on teaching and learning and the continued inclusion of a section 
on measurement, statistics, and research design. 

The most notable changes in the content and organization of the 
text reflect the authors’ concern with presenting a teaching oriented 
discussion of human intellect and learning. In Chapter 2 on human 
abilities, the authors introduce Guilford’s model of the structure 
of intellect. Chapter 3 is a new chapter which did not appear in 
the first edition. It presents learning theories with a decided bias 
away from the reductionist position and toward a cognitive theory 
approach. In Chapter 7, “Factual Information and Concepts,” 
there is emphasis added on the notions of Piaget, but original 
bbe are foresaken for the adaptations by Elkind and Flavel. 
nbi M oy integration, a new developmental 

Hs the integration of personality traits in childhood and 
adolescence 18 presented. In Chapter 14, on providing for individ- 
ual differences, a timely, but inconclusive, section is added on the 
ть associated with teaching culturally disadvantaged chil- 

In the seventeen chapters that were revised, the mean incidence 
of the use of new titles in the list of references at the end of the 
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chapter is about fifty per cent. The chapters range from no new 
titles in the chapter on statisties and research design to eighty 
five per cent new titles in the chapter on instructional media and 
organization. 

Klausmeier and Goodwin have faced a problem that must be 
resolved one way or the other by consumers of psychological re- 
search. The question is, should one accept only that experimental 
evidence based directly on observations of human and infrahuman 
behavior or, on the other hand, should one be concerned with some 
of the more promising theoretical constructs which reflect the prob- 
lems faced by educators in professional situations? In choosing the 
latter position, Klausmeier and Goodwin have in a way for- 
saken certitude for significance. This decision was crucial, since 
the worth of the text to any given instructor depends on the theoret- 
ical position he brings to his class. Learning and Human Abili- 
ties is a text which a behavioristically oriented instructor would 
find uncomfortable to work with, but it does attempt to deal with 
problems which many teachers meet daily; problems a behaviorist 
would feel obliged to ignore. 

In the chapter on statistics and research design, no mention is 
made of the assumptions which must be met by the data, before a 
technique is appropriate. The authors claim, “The purpose of this 
chapter (“Statistical Research and Design”) is to clarify the more 
common statistical terms and symbols frequently encountered in 
research reports so that the reader can interpret such reports mean- 
ingfully.” An important part of interpretation would appear to be 
judging if the statistical technique used was appropriate. If one is 
ignorant of the assumptions of a technique, then meaningful inter- 
pretation would appear to be impossible. 

The new chapter in the second edition titled “Learning Processes 
and Theories” fits the authors’ overall plan well, since it empha- 
sizes the ideas of such theorists as Ausubel, Bandura and Walters, 
and Verplank. However, it might have been appropriate to include 
here a discussion of some of the more extreme behaviorists, even 
if only to refute the suitability of their notions in school set- 
tings. It seems somewhat unrealistic to write a chapter on learn- 
ing theories and to exclude Pavlov, Thorndike, Guthrie, and 
Skinner. 

Learning and Human Abilities reflects an attitude about the 
function of educational psychology in the field of systematic in- 
quiry into the nature of human behavior. It appears that it is the 
authors’ conscious desire to show how psychological prediction and 
hypothesis testing can be applied to the classroom. The book 
serves as a delineator of the liason function of the educational 
psychologist between the laboratory and the classroom. The аш 
thors’ bias towards cognitive theories and away from behaviorists 
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models is perhaps a little extreme, but for college instructors who 
espouse the cognitive position, it should be a useful teaching tool. 


Jack L. SLOAN 
Albany Study Center for 
Learning Disabilities 


Clinical and Social Judgment: The Discrimination of Behavioral 
Information by James Bieri, Alvin L. Atkins, Scott Briar, 
Robin Lobeck Leaman, Henry Miller, and Tony Tripodi. New 
York: John Wiley & Sons, Ine., 1966. Pp. xiv + 271. 


Rather than broadly surveying concepts and research in the 
areas of clinical and social judgment as the title of this book sug- 
gests, the authors have chosen instead to concentrate on the di- 
verse literatures of information theory, anchoring phenomena, in- 
dividual differences, situational factors, and the roles of structure 
and affect in judgment. The intensity of topical coverage varies 
inversely with the apparent relevance of the research involved to 
clinical and social judgment. Consequently, there is no single 
organizational principle underlying the various topies covered, 
although,the authors have addressed themselves to the basic ques- 
tion: “. . . What factors in the judgment situation lead to differ- 
ences among judges in their ability to discriminate among the be- 
havioral information available to them?” (p. viii), In particular, 
the authors postulate that this judgmental ability is a function of: 
(a) the dimensionality of the stimuli in terms of the number of 
dimensions and the number of discriminable intervals along any 
one continuum, (b) the type of response alternatives available to à 
judge (e.g., nominal vs. ordinal seale), (c) individual differences 
with respect to the personal construct systems of the judges; and 
finally, (d) situational factors affecting the judgment process. 

After the introductory theoretical orientation which includes а 
description of models used to depict the judgment situation (e.g. 
the categorical vs. the dimensional model) ; the mathematical in- 
formation model is reviewed and various information theoretic 
measures are mathematically defined as well as given psychologi- 
cal interpretation. Operationally defining judgmental discrimin- 
ability as amount of transmitted information, the authors explore 
the effects on discriminability of such variables as stimulus range 
and ‘number of response categories. Unfortunately, few empirical 
studies have been conducted in the clinical and social areas using 
an information theoretic approach. However, some research by 
various of the authors of the book is reported in detail, and their 


results would indicate that although information t 


у n I Ithoug ransmission in- 
creases with stimulus dimensionality for physical stimuli, this is 
not the case for behavioral stimuli, In fact, under certain condi- 
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tions, increased stimulus complexity leads to a decrement in dis- 
criminability. The authors speculate that the reason for these judg- 
mental differences between physical and behavioral stimuli may 
be due to the limited nature of the behavioral response system, at 
best a unidimensional ordinal scale. 

The authors proceed from information theory to their next 
major topie, anchoring effects in judgment, and here again their 
indebtedness to the psychophysical literature is apparent. Their 
discussion is concerned primarily with the determinants of such 
anchoring effects as response variability and judgmental shift to- 
ward (assimilation) or away from (contrast) an anchor. The em- 
pirieal research cited is carefully interwoven with two major the- 
oretical approaches: those that foeus on the role of scale centering 
such as Adaptation Level Theory, and those focusing on the dis- 
tance between the stimulus judged and anchor. On the basis of 
psychophysical and social attitude research these two theoretical 
orientations are evaluated, internal and external anchoring phenom- 
ena are examined, and studies are compared for differences in 
methodological procedures as well as in operational definitions of 
anchoring. 

The final chapters of the book are devoted to cognitwe-struc- 
tural individual differences among judges, situational factors as 
determinants of behavioral judgments, and the effect of arousal 
on diseriminability. These final chapters, although relatively 
shorter than the preceding ones, are perhaps the most relevant to 
the title of the book because of the greater wealth of clinically and 
socially oriented research. Although individual differences are pre- 
sented within the broad theoretical contexts of cognitive structures 
and dimensional approaches to personological variables, most, of 
the studies cited by the authors in this context are those employing 
the cognitive-stylistie dimension of cognitive complexity-simplic- 
ity. The relationships of this variable to such variables as judg- 
mental accuracy, confidence, and discriminability of complex stim- 
uli, are explored. It is unfortunate that research dealing with other 
cognitive styles does not receive the same attention given the com- 
plexity-simplicity dimension. It is, however, to the authors’ credit 
that they devote their final chapters to the oft-neglected area of 
situational factors in judgment which includes setting and back- 
ground factors as well as interpersonal situations as determinants 
in the judgment situation. Although the empirical work on situa- 
tional factors as determinants of behavioral judgments is scant, 
the reader is provided a general framework suggestive of future 
areas for profitable research. 

In general it can be said that for each topic covered the research 
presented is well-organized, a reasonable theoretical orientation | 
is provided the reader, and avenues for future research are care- 
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fully explored. However, these merits are vitiated by the facts that 
the book as a whole is somewhat dull, the topics covered are 
often more relevant to psychophysics than to behavioral judgment, 
and the paucity of clinical and social research in many of the 
areas tends to make a book of this size somewhat premature. The 
major defect of the book lies in the amount of coverage given such 
areas as anchoring phenomena, in which there is a preponderance 
of psychophysical research, some social research, and a striking 
lack of clinical research, at the expense of so little coverage of such 
areas as the application of the Brunswickian Lens model to clinical 
judgment in which піса] judgment studies are rapidly accumulat- 
ing. It is also surprising that although the authors are concerned 
with individual differences among judges which many of the aver- 
aged measures of diseriminability tend to obscure, no such concern 
is given to the possibility that a unidimensional response continuum 
for complex social stimuli may be obscuring the possibility of more 
than one perceived stimulus dimension. Because of its limited and 
uneven presentation of topics, this book is geared primarily for the 
researcher working in the specific areas covered. The book also may 
be useful to the social or clinical graduate student in search of a 
provocative idea, 


Nancy Wicarns 
University of Illinois 


The Rorschach in Practice by Theodora Alcock. Philadelphia and 
Montreal: J. B. Lippincott Com ‚ 1963. Pp. xii -+ 252. 10 
нр Lipp’ pany, Pp. xii + 


The Rorschach in Practice was written in response to the re- 
quest of the author’s colleagues at the Tavistock Institute of Hu- 


unpublicized British developments in the use of the test.” Eric 
Trist says in his foreword: “We have wanted to preserve, undis- 
torted, (Theodora Alcock's) practice with the Rorschach as a 


ges with fine praise. She uses all his 

with some а жойсо and 

recommend ves. ington’s state- 

mind and the pereas tion to quote twice—"The perceiving 
total 


Perceptual technique. This 
orated further. In the elements of the testing situation, she finds it 
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possible to discern the dynamic interaction "between the individ- 
ual perceiving mind and the perceived world from which personal- 
ity has been evolved, and is still in the process of 
with all the complexities of past experience, of phantasies, 
of defence systems conscious and unconscious, as they are reflected 
in the current situation.” To interpret this dynamic interaction, 
Miss Alcock uses a psychoanalytic frame of reference. In this re- 
spect she differs from Bruno Klopfer, who relies more heavily on 
a Jungian framework. Yet one has reason to believe that both 
clinicians would arrive at essentially similar interpretations of a 
subject’s personality structure. They would use different clues. 

The volume is divided into three parts. Part I, containing only 
100 pages, achieves the difficult task of giving clear expositions 
of the rationale of the scoring system, the method of analyzing the 
projected personality, and an introduction to differential diagno- 
sis. Part II contains nine diagnostic studies through which the 
student can follow the intricate steps by which Miss Alcock, as 
diagnostician, sets up hypotheses, checks them with the data ob- 
tained from formal and sequence analyses, and only then arrives 
at an interpretation, The cases presented cover conflict and de- 
fense within a healthy personality; indications for or against psy- 
chotherapy; psychotic records; two cases of intracranial damage; 
and one on educational disability in an intelligent boy. These 
are a veritable goldmine for the instructor, particularly for one 
with a psychoanalytic orientation. All of the cases were followed 


these patients, who are compared with three control groups. The 
bibliography of approximately 500 items—of which no more than 
50 


the study of this technique. Part II is particularly valuable for 
teaching and developing the art of diagnosis. The book does not 
recommend itself as a basic text, because it is not so rigorously 
edited as one would wish for the beginning graduate student, nor 
80 detailed as necessary for the advanced student. Scoring 

times inconsistent with previous instructions; for example, in the 
use of F per cent as compared with F-plus per cent. One aware 
that the contents were written over a period of time, and that all 
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insight into the Rorschach method of diagnosis, and, of course, 
to the advanced Rorschach student. In reading this relatively slim 
and well-printed book, one can turn visitor to the Tavistock In- 
stitute of Human Relations and its Clinic, and as such, gladly 
overlook inadequacies while allowing oneself to enjoy the rich 
contributions placed before him, as well as the minor pleasures of 
savoring the British flavor provided by learning that the patient 
had lost “one stone” (fourteen pounds) or that the sex of an H per- 
cept was scored as “innominate” (unidentified). 

This is a friendly book. One easily recognizes that Miss Alcock 


is a gifted teacher and clinician. She makes the obscure become 
self-evident, 


FLORENCE DIAMOND 
Pasadena, California 


Role Conception and Group Consenus: A Study of Disharmony 
in Hospital Work Groups by Eugene Haas. Columbus, Ohio: 
Bureau of Business Research, The Ohio State University, 1964. 
Pp. xiv + 138. 

Role Conception of Vocational Success: A Study of Student and 
Professional Nurses by Marvin J. Taves, Ronald G. Corwin, 
and J. Eugene Haas, Columbus, Ohio: Bureau of Business Re- 
search, The Ohio State University, 1963. Pp. xiv + 120. 


These investigations have in common the study of the influence 
of the role concept. Although the copyright date of the first mono- 
graph is later than the second, it is really a broad introduction to 
the theory common to both studies. In the 1964 monograph the 
focus is on the group, the relation of the conception about roles 
upon harmony in work groups. In the 1963 monograph it is on the 
individual. His concept of his role in relation to his success and 
satisfaction in his occupational setting. The former monograph 
(1964) makes use of role conceptions in several hospital groups, 
and the latter one (1963) deals with role conceptions of student 
and professional nurses as individuals, In both cases these might 
just as well have been other kinds of organizations and persons 
pursuing other types of work. The main contributions of the studies 
are in clarifying the nature and importance of role conception to 
group and individual functioning in work situations, and in de- 


veloping means of measurin role concepti 
cepts utilized in these ш. do Sell 


First, consider Role Conc 
three chapters review the li 
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This review of literature in relation to the development and refine- 
ment of role theory sets the framework for the research design. 

The principal hypothesis seems to be that the level of role con- 
sensus is related to the level of performance and to the level of 
sociometric preference among members of the organization. А low 
level of consensus, it was assumed, would result in irritations, low 
performance, and low sociometric preferences within the group. 
There is a closely knit set of hypotheses, for which a research de- 
sign is provided for testing each one of these hypotheses in turn, 

Hospital work groups were used to test the elements of the 
theory. Such groups were available and the need for this kind of 
research among these groups appeared to be evident. However, the 
study could have been conducted in almost any voluntary group 
or organization such as those in business, industry, teaching, or 
social work with a probability of useful benefits. 

The instruments used for the collection of data are described in 
detail. These include a role concept inventory, a performance rat- 
ing chart, a hospital sociometrie questionnaire, and a controlled 
interview. Complete copies of all instruments, except the interview 
procedure, are reproduced in the appendix. The interview method 
is described on pages 54-64. А 

In general, the hypotheses were supported by the data in this 
study of hospital work groups. The author is careful to point out 
elements of hypotheses which are not upheld and shows caution in 
his interpretation and generalization of the findings. There is a 
need to test these elements of role theory in other groups. Such a 
study might point the way to better human relationships and to 
higher productivity in almost any group if it led to the clarifying 
of roles and if consensus were used to attain such a clarification. 

The second monograph, Role Conception and Vocational Suc- 
cess, applies role theory to the individual. The lives of most per- 
sons are organized around their occupations. It is assumed in this 
study that the individual’s image of his occupation and his con- 
ception of his work role have much to do with the nature of his 
broader self-conception. If the image of the occupation is favorable 
and if the individual perceives the relationship of this occupation 
to his own goals and values, it is hypothesized that these factors 
will lead to feelings of job satisfaction and success. These feelings 
will be further enhanced if his self-conceptions are consistently 
confirmed by the actions of his peers and supervisors. This study 
was designed to test the application of these and related theoretical 
constructs from role theory and self theory. In the words of the 
authors of this report: 


. clarity of self-conception, self-esteem, and career goals 
are considered to be intervening variables which relate (a) 
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work-group consensus and occupational image on the one hand 
with (b) satisfactory and successful role enactment on the 
other.” (p. 16) 


The study is based on two samples of nurses, namely, student 
nurses in four schools and nursing personnel from typical medical 
and surgical stations in three hospitals. Involved were 406 student 
nursés and 156 members of the hospital staffs, representing a cross- 
section of the nursing personnel. 

Eight instruments were used to gather data to test the assump- 
tions and hypotheses set forth in the plan of the study. These are: 


Image of Nursing Scale (INS) 

Comparing Vocations Inventory (CVI) 

Hospital Station Role—Conception Inventory (HSRCI) 

Student Nurse Role—Conception Inventory (SNRCI) 
Performance Rating Charts 

1. General Duty Nurses—on the Station 

2. Student Nurses—in the Classroom 

Bullock Job Satisfaction Scale 

Modified Bullock Job Classification Scale to Measure Satis- 
faction with Educational Experience 


These instruments are reproduced completely in eight appendices 
and are described in some detail on pages 18-23. 

The evidences and their interpretation are organized into two 
sections, one dealing with student nurses and the other with hospi- 
tal personnel. Statistical data are given to support conclusions. It 
seems that the educational success of student nurses is related 
significantly to their image of nursing, their role consensus with 
Supervisors, and their satisfaction with their educational experi- 
ences. There is a low and inconsistent association between their 
Conception of nursing and certain of the success ratings. Stu- 
dents of average success more than either the high or low success 
students tend to identify to a greater extent with the peer group 
and official supervisors than with abstract professional standards. 

, For general duty nurses, satisfaction with their work is related 
significantly but at a low level with their role conception and fa- 
|a emi of their image of nursing. If official position rather 
an ratings by supervisors is used as a measure of success, then 
шш апа Success shows a positive relationship. у 
р TOR ARN measurement specialist, and researcher will 
kgs UCy suggestive. The careful analysis of the problem, 
: 'elopment and assembly of needed appraisal instruments, 
the copious footnotes and bibliography as well as th lysis of 
the data may contrib: ue 
y contribute to the work of others dealing with re- 
lated problems. There are broad implieations both for vocational 


BOOK REVIEWS 247 


guidance and for on-the-job training which might be applicable 
in industrial, educational, and service organizations. 

So far as the specific findings of these two monographs are con- 
cerned, they will be found especially helpful for hospital adminis- 
trators and boards, nursing personnel, and others who are in any 
way connected with hospital and nursing occupations. So far as 
the broader problems of role theory, self-conception, role concep- 
tion, group consensus, and research design in the social sciences are 
concerned, these studies will be of interest to the sociologist and 
social scientist. The testing of certain aspects of role theory by a 
scientific approach and the implications of these studies for similar 
approaches to unrest, low production, and dissatisfaction in or- 
ganizations and among persons in various occupational settings 
should make these two monographs of interest to psychologists and 
industrial personnel as well as to social scientists. 


CHESTER О. MATHEWS 
University of California, Santa Barbara 


Dimensions of Psychotherapy by D. R. Stieper and D. N. Wiener. 
Chicago: Aldine Press, 1965. Pp. viii + 180. $5.95. d 


Subtitled *an experimental and clinical approach" this little vol- 
ume concerns itself with the question “What makes therapy run?" 
Or as the dust jacket has it: “What factors common to all types of 
therapy are most significant in producing change in patients?” 

The book in a business-like and concise manner singles out: (1) 
the therapist and his characteristics, (2) the patient and his char- 
acteristics, (3) the therapeutic process or the products of interac- 
tion between the two. It then goes on with a chapter on definition 
and research, a review of the research literature, the patient and 
his problems, the therapist, and the interaction. Chapter eight con- 
tains the conclusions. 

The conclusions turn out to be five factors which the authors 
say, are basic to patient change. They are (a) effective therapist 
role, (b) shaping motivation, (c) setting effective goals, (d) utiliz- 
ing historical material and (e) maximizing conditions for learn- 
ing. The last chapter therefore makes an excellent review of the 
conditions for effective therapy. It is something that is useful to 
any practitioner, therapist, or counselor alike. 

This method of integrating the various schools of therapy is an 
ingenious one, and offers advantages not found in other ap- 
proaches. Since many writers currently feel the need of producing 
a “general field therory" covering all the therapies, such a book 
represents a useful addition to the therapist/s library. It is perhaps 
not too elearly written in spots, and is subject to other criticism of 
efforts to slice up the pie in a new manner. The “factors” are 
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empirically derived with little evidence that it is they rather than 
other aspects which produce client change. However, in a disci- 
pline groping for structure, one needs hypotheses, and hypotheses, 
like people, are better judged by their consequences than by their 
antecedents. 


Joun C. Gowan 
San Fernando Valley State College 


In the Autumn 1966 issue, a reference and an illustration in 
one of the articles may inadvertently have conveyed a false impres- 
sion. The illustration on pages 744 and 745 and the last sentence 
of the second paragraph on the preceding page might permit a 
reader to infer that test users may undertake to make their own 
answer sheets for published and copyrighted tests such as the 
Differential Aptitude Tests. This is not the case. The answer sheet 
(or card or other answer media) involved in the active test-taking 
behavior of the examinee is an integral part of the test itself. In 
the case of copyrighted tests, special revisions, changes, adaptations, 
or local production of answer sheets for any reason may be under- 
taken only with the written consent of the publisher. 


ERRATA 


In the Autumn 1966 issue of this journal the article “Tension in 
Freshman and Senior Engineering Students” by Ben H. Romine, Jr. 
and W. Scott Gehman contained an error which the authors wish 
corrected. On page 566 the Endler, Hunt, and Rosenstein’s Stimulus- 
Response Inventory of Anxiousness was mistakingly referred to as 
the Stimulus-Response Inventory of Anxiousness. 

In the article “Note on Rank Biserial Correlation" by Gene V 
Glass which also appeared in the Autumn 1966 issue the author has 
discovered that bars were omitted from over the X and Y in the 
ninth and tenth lines from the bottom of page 625 as well as from 
the Y's and X's in the last and next to last lines of the same page. 
On page 628 bars were also omitted from over the Y's in the tenth 
and eleventh lines from the bottom. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1967, 27, 253-255. 


EXPANSION OF SIMILARITY ANALYSIS BY RECIPROCAL 
PAIRS FOR DISCRETE AND CONTINUOUS DATA 


LOUIS L. McQUITTY 
Michigan State University 


Critique 

Aut reciprocal pairs of any matrix could have been operated on 
in preparing a new matrix. For example, Matrix 2 contains two re- 
ciprocal pairs, B-CD and E-F. Each of these could have been com- 
bined to yield a final matrix directly from Matrix 2 without first 
producing Matrix 3. 

A crucial problem in theory is, however, introduced if we do not 
proceed through all stages. How do we compute the entry for Row 
BCD, Column EF of the final matrix from the entries of Matrix 2? 
Earlier when we computed it from Matrix 3, we solved for: 


@вер-вв = 


i pac = 18.5, rounded to 19 6) 


In Equation 5, the individual types E and F, which are being 
joined to form a revised type are each given a weight of 15 as com- 
pared with the Type BCD which was formed in an earlier stage; 
Type BCD is given a weight of one. 

In more general terms, in computing the association between & 
type just being formed with one previously formed, each of the two 
typal representatives of the new type is given a weight of 14 as com- 
pared with a weight of one for the previous type. This means, how- 
ever, that each, the new type and the previous type, is given a 
weight of one in determining the relation between them and this is 
true irrespective of the number of individuals in each type. Conse- 
quently, the amount of weight given any one individual in comput- 
ing the association between types varies with two considerations: 
(a) the number of individuals in the type, and (b) the sequential 
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order in whieh a member enters a type, the weight given an individ- 
ual in entering a type increasing the later he enters. If a representa- 
tive type of four individuals Joins an individual type, then the lat- 
ter individual is given as much weight as the four indiyiduals. E 

‘The above method of weighting is consistent with the theory out 
of whieh the method was developed, viz., the common characteris- ` 
thes of n members of a type are a better representative of that type. 
than are the eharneteristies of any one of them. If a developing type 
must be deseribed in terms of n individuals before it and a single ` 
individual are reciprocal in their likeness, then that single individ- 
wal № as good a representative as are the n other individuals jointly. 
‘They play equal roles in producing the reciprocity. 

Both the above theory and successes of this method of classifica- 
tion have some interesting connotations; they suggest that the 
strong personality, and perhaps the mentally secure person, tend to 
be more individualistic; they stand relatively alone and classify 
with others in terms of only very commonly held characteristics. 

In an Alternative approach, we can derive а final matrix analo- 
gon to Matrix 4 directly from Matrices 1 and 2 without first pro- 
dedeg Matrix 3. It will, however, weight individuals differently 
from the fashion just outlined. Each individual's weight will equal 
qne over the number of individuals in his type and will be inde- 
pendent of the sequential order in which the individuals entered the 
typa. Bach of the two types being joined will have equal weight. 

Is the above approach: 


СЕТЕ ТЕЗ ЕЯ 


= 195/6, roundedto 20" (6) 
‘The fint three numbers of the numerator are from the first three 
(elle of Row E and the next three numbers are from the first three 
бе о Row F, both thus form Columns B, C, and D of Matrix 1. 
‘The Seal matrix by this solution is as shown in Table 3. 


> 


TABLE 3 
А Final Matris Directly from Matriz # 


LOUIS L. McQUITTY m 
In general notation, Equation 6 becomes: 
ак» + gc + Opp + ау» + Ore gg 
ак» + а, а A ; m 
when n is the number of individuals in one type and m is the sum- 


ber in the other type. 
The entire equation can be generalized to: 


@вер-ку = 


S., = the original similarity index. 
= = Type z represented by individuals z,, 2», zy *** № 
y = Type y represented by individuals y,, ys yw ''* т. 
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A NOTE ON THE ANALYSIS OF GAINS 
AND POSTTEST SCORES 


MAX D. ENGELHART 
Duke University 


Waars pre- and posttest yield comparable scores justifying com- 
putation of gains and the subjects in two groups are paired on the 
basis of pretest scores, the matched groups formula yields the same 
t whether posttest scores or gains are analyzed. Where the pre- and 
posttest yield comparable scores and the groups are not initially 
equivalent, analysis of covariance yields the same F and identical 
adjusted differences between posttest means and mean gains. All of 
this is true even when the correlation between pretest scores and 
gains is negative. 

In view of the problems confronted in analyzing experimental 
data involving growth or gains, it seems strange that, so far as this 
author knows, no text or other publication explains what is stated 
in the paragraph above. When subjects are paired on а pretest 
which yields scores comparable to those of the posttest, to prove 
that the formulas 


if poo 
CA ram а) 
б, —.Q, 


Eg nd) 


give the same & it suffices to show that the numerators are equal 
and that zy?(1 — т) equals 02(1 — тш). The correlations 


should be within groups correlations. _ 
Because of pairing on the pretest X, = X». Hence, 
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The expressions just shown to be equivalent, whether written in 
terms of scores or gains, are the same as those used in the analysis 
of covariance to obtain the adjusted sum of squares for total and the 
adjusted sum of squares for within groups using the appropriate 
sums of squares and products. Given equal adjusted sums of 
Squares for total and equal adjusted sums of squares for within 
groups, it is obvious that the reduced sums of squares for between 
groups are also equal, whether based on scores or gains. Hence, the 
same F is obtained. 

While differences in adjusted posttest means can be shown alge- 
braically to be equal to differences in adjusted mean gains from 
group to group, comparing the results of such equations as 

Ў = Ү, (X,— X) 

G, = G, Yu 5,0, СЕМ X), 
when these equations are expressed in deviate measures, not only 
are the differences in adjusted posttest means equal to differences in 
adjusted mean gains, the adjusted posttest means when expressed 
as deviate measures actually equal the adjusted mean gains also ex- 
pressed as deviate measures, 


Tn deviate measures, for any group j^ 


=%—bg@andg’ = 0 — b, 
Since Z, = 0. 
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b, = y 7 i ЭР2 and g=g—@ 
Hence, 
j—bi- 
gy Райи шә 
© 
q-9-Qum ED, 


The last expression simplifies to 


MDE) 

y 55 а? 

Given two groups equated and tested as earlier described, the 

standard error of the difference between posttest means or mean 

gains may be obtained by computing the standard deviation of the 

individual differences between the posttest scores or between the in- 

dividual gains of each pair of subjects, since these are equal, divid- 

ing this standard deviation by the square root of one less than the 

number of pairs. This standard error is, of course, equal to the one 
obtained by either of the formulas: 


Мв? + Soa — 28.8, OF У S5) + 8,2 — 285Snlno. 
where the correlations are between the posttest scores or between 
the gains of the paired subjects. When the correlations referred to 
can be wholly attributed to what the paired subjects have in com- 
mon at the start of the experiment and is measured by means of the 
pretest, and when a single determination is made of the variance 
within the groups of the posttest scores or of the gains, then it 
can be shown that ту» = Tey” and т» = т? and the matched 
groups formulas earlier listed will give the same £ as the formulas 
just referred to except for a slight discrepancy due to division by 
one less degree of freedom. Since the correlations used in the 


1 Where the standard error is obtained from the standard deviation of the 
individual differences, one enters a table of ¢ with the degrees of freedom as 
one less than the number of pairs. Since this standard error equals the stand- 
ard error obtained through use of the other formulas given above, it is argued 
that the number of degrees of freedom is also one less than the number of 
pairs. How is this to be reconciled with Ni + Ns — 3 as the number of degrees 
of freedom used with the matched group formula and with the adjusted 
within groups variance of the analysis of covariance of two groups? 


д or 9 — 0,2. 
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matched groups formulas earlier listed are within groups correla- 
tions the value of ¢ obtained, when squared, equals the F which 
would be obtained if analysis of covariance were applied to the pre- 
and posttest scores or pretest scores and gains of the equivalent 
groups. Given the limiting conditions outlined above analysis of 
scores or gains should in each case yield the same ¢ whose square 
is equal to the same F. Where pre- and posttest yield comparable 
scores and the groups are nonequivalent, as earlier explained, co- 
variance analysis yields the same Ё and the same adjusted dif- 
ferences between groups whether pre- and posttest scores or pretest 
scores and gains are analyzed. 
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A NOTE ON TEST ITEM DIFFICULTY? 


HAROLD A. EDGERTON ax» PETER H. STOLOFF 
Performance Research, Inc., and the University of Maryland 


In reviewing the construction and use of the Science Aptitude Ex- 
aminations used in the selection process of the Annual Science Tal- 
ent Search for the Westinghouse Science Scholarship and Awards, 
some improved methods of maintaining the high difficulty level of 
the examinations seem desirable. . 

The Science Aptitude Examinations have relied primarily on 
standard type multiple choice questions having four or five choices 
with one answer correct, and all others wrong. Some of the more 
difficult questions have been based on remote or obscure facts; oth- 
ers have required so fine a discrimination between the right an- 
swer and other answers which are almost, but not quite, right that 
there have been comments by the test item critics that the difference 
between right and wrong answers are too little to carry good a priori 
validity. | К 

In most years, the frequency distribution of Science Aptitude 
Examination? scores for girls has shown a positive skewness, but for 
boys the distribution usually has shown a slightly negative skew- 
ness. 

Since the major use of test scores involves discrimination in the 


making data available 


1The authors wish to thank Science Service Inc., for 
Maryland for the use 


and the Computer Service Center of the University of 

of their facilities in the analysis of the data. Е 
2 An entirely new form of the Science Aptitude Examination 18 prepared for 

each year; twenty-four have been used and later released to the public, and 

e twenty-fifth is now in press for use in the 25th Annual Science Talent 
earch. 
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upper range of scores, positive skewness is desirable. Considerable 
attention has been given to ways of obtaining the needed difficulty 
and its consequent forms of frequency distribution of scores. 

Ways of increasing the difficulty of test questions were reviewed. 
Just to write questions of the same type but of greater difficulty did 
not seem to be the answer. That had been tried, but the contestants 
in the Annual Science Talent Search are an unusually capable 
group. 

Then came the notion that the concept of compound probability 
could be useful. For example: if a question has five answer alterna- 
tives of equal difficulty, the probability of getting the right answer 
is .20. If two of the answers are correct, the a priori probability of 
selecting the two correct answers is .04 (= .2 X 2). Even if the real 
a priori probability of marking each of 1, 2, 3 or 4 right answers is 
high, the probability of marking both of the two correct answers, or 
the 3 or the 4 correct answers will be less than marking only a one 
right answer. The difficulty of a question is the empirically observed 
probability of selecting exactly the right answers. 

By presenting a set of multiple choice questions in which some 
questions have one correct answer, some have two correct answers, 
etc., does not set a particularly new task. The scoring formula is 
more unique; a question receives a score of 1 when all correct an- 
swers and not others are marked as correct, otherwise the score is 0. 

This organization of questions and answers may elicit some 
negative reaction because it is not widely used. It does require extra 
care in preparation of answer alternatives so that the task of dis- 
criminating rightness and wri gness of eachvalternative is a feasible 
one for the test subjects. This form of question reduces the likeli- 
hood that test-wise subjects can determine the correct answer by 
cues extraneous to the purposes of the test. 

Six questions of this type were included in Part A of the Science 
Aptitude Examination for the 23rd Annual Science Talent Search. 
To see how the new item form functioned, the answer sheets for the 
STS participants from Ohio and Wisconsin (excluding those who 
won honors) were used. The Ohio sample produced 151 answer 
Sheets, and the Wisconsin sample included 51, a total of 202 for 
this study. a 


The six test questions had a total of seventeen right answers out 
of a possible twenty-four. 


8 
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Table 1 summarizes the data for these six test questions: 


TABLE 1 
Difficulty Levels of Selected Test Items Difficulty 
No. Rt. 
Question Answers Item Answer 
A 2 .78 .82 
B 4 .29 .70 
с 3 .51 .81 
р 3 .28 .65 
E 2 .45 .61 
F 3 .21 .60 


The difficulty data show that for question A, having two right 
answers, 78 per cent marked both right answers and no others. How- 
ever, each of the two correct answers was marked by an average of 
82 per cent of the contestants. The difference in difficulty of getting 
the question answered entirely correctly and the average difficulty 
of the correct answers show rather clearly for questions В; C, D, E 
and Е. Y 

The correlation between the two sets of scores is only 0.80. The 
correlation of number of questions right and number of answers 
right was 0.85 and 0.93 respectively with number of answers right 
minus number of answers wrong. These evidences indicate that 
the questions right score is not the same as a more usual score de- 
rived from the answers without reference to the questions to which 
they belong. А 

As shown in Figure 1y the distribution of scores on these six ques- 
tions for answers correct shows a more marked skewness than did 
the scores for questions right. The distribution of scores for ques- 
tions correct produced a slight positive skewness, but the answers 
correct distributions (both rights and right minus wrongs) are nega- 
tively skewed. 

The concept of multiple answers and multiple choice questions 
was again used in the 24th Annual Science Talent Search. Part A? of 
the Science Aptitude Examination contained 45 questions, 31 of 


з Part A contains independent five-choice multiple choice questions cover- 
ing science information and problems. Part B contains paragraphs of informa- 
tion drawn from science publications, each followed by two or more five-choice 
multiple choice questions based on the material in the paragraph. 
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Figure 1. Frequency distribution of scores on six Selected questions. 
(N = 202) 


which had two or more correct answers, a total of 99 possible cor- 
rect answers. 

Part B contained 51 five-choice multiple choice questions, of 
these 25 had more than one correct answer, a total of 89 correct an- 
swers to the 51 questions. 

The test records for the 114 Ohio nonhonor participants in the 
24th Annual Science Talent Search were examined. 

For each part two scores were obtained: number of questions an- 
swered entirely correctly and number of answers marked correctly. 
Table 2 shows for the Ohio sample the relationship between Part 
A scores obtained by giving one point for each question answered 
entirely correctly and a Part A score showing the number of cor- 
rect answers marked. 

The scores of neither distribution showed marked skewness, but 
those for questions right has a slight positive skewness, The corre- 
lation between the two sets of scores (on the same answers to the 
same questions) is 0.67. Of the 26 (24%) who answered ten or more 
questions correctly, only 15 were also in the top 26 score for answers 
correct. 
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TABLE 2 
Relationship of Number of Questions Correctly Marked 
and Number of Answers Correctly Marked 
(Part A) 
—_ —_——_—— 
Number of Right Answers 
20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 Fr. 
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Similar contingency tables for Part В and for total scores (not 
shown here) show similar relationships. The correlations between 
questions right and answers right are 0.75 and 0.74 respectively for 
Part B score and for total score. 

This presents the question as to which scoring method is better. 
The test was designed to sample knowledge in various areas of sci- 
ence and to sample the contestants’ ability to think and reason in 
the vocabulary and concepts of science. 

The marked increase in positive skewness sought was not pro- 
duced. 

From another point of view the score for questions right seems to 
be superior. It requires more careful scrutiny and evaluation of all 
answers offered and also requires a thoroughness in answering not 
demanded by treating each answer as an independent True-False 
question. " ў 

On the basis of the contingency table (Table 2) one could believe 
that there are students who are good right answer markers who are 
not so competent on getting questions right, but relatively fewer of 
the converse category. 
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COGNITIVE STYLES IN THE SCHEMATIZING PROCESS: 
A CRITICAL EVALUATION: 
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DOUGLAS N. JACKSON? 
University of Western Ontario 


In psychophysical judgment, as in interpersonal situations, in- 
dividuals differ in the degree to which they notice small differences 
and changes in sequences of stimuli. Some Ss, termed “‘levelers,” 
seem to be relatively insensitive to differences between similar stim- 
uli, while others, termed “sharpeners,” tend to emphasize these dif- 
ferences. Each approach probably has both functional and dysfunc- 
tional aspects. Thus, the sharpener might have his counterpart in 
lay personality theory in either the incisive critic or the individual 
who ruins his eyes splitting hairs. In contrast, the leveler might be 
the eclectic theoretician quick to see the consistencies in diverse 
positions or, on the other hand, the muddle-headed bungler who 
constantly confuses distinct alternatives. 

The present study is a systematic attempt to define, to measure, 
and to examine personality correlates of certain consistent individ- 
ual differences in the areas of perception and cognition previously 
identified as leveling-sharpening. In this attempt we have paid par- 
ticular attention to contemporary approaches to construct valida- 

1 This study is one of a series supported by а USPHS Research Fellowship 
to Odin C. Vick and by Research Grants from the National Institute of Men- 


tal Health and the Ontario Mental Health Foundation. Katherine D. Baker 
provided valuable editorial assistance, and Samuel Messick provided materials 


and helpful advice. 2 
2'This study was written when Odin C. Vick was at Educational Testing 
Service and Douglas N. Jackson was at the Institute for the Study of Human 


Problems, Stanford University, as & PHS Special Research Fellow. 
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tion (Cronbach and Meehl, 1955; Loevinger, 1957), as well as to 
certain elementary concepts of classical test theory. The problem of 
defining individual consistencies in cognition as an approach to the 
study of personality is both a challenging and. a hazardous under- 
taking. Our understanding of the manner in which personality 
structure manifests itself in diverse perceptual and cognitive proc- 
esses has been greatly broadened by the pioneering efforts of such 
research workers as Thurstone (1944), Witkin (1950), and the 
Menninger group. However, here, as elsewhere in science it is impor- 
tant to avoid the dangers of allowing enthusiasm with initial re- 
sults to outstrip careful refinement of procedures, and of allowing 
hypothesis and theory elaboration to overwhelm the equally im- 
portant challenge of empirical validation. While there have been 
notable exceptions to this tendency (Witkin, Lewis, Hertzman, 
Machover, Meissner, and Wapner, 1954; Witkin, Dyk, Faterson, 
Goodenough, and Karp, 1962), this research area, even more than 
most, requires the most painstaking, even tedious, analysis of the 
properties of measures, if its initial promise is to be repaid with 
hard currency. 

The terms “leveling” and “sharpening” were first described by 
Wulf (1922; cf. Koffka, 1935) as referring to different ways in 
which a memory trace might change over time. Leveling involved a 
loss of detail, a tendency for certain prominent aspects of the figure 
to become less salient or to disappear; conversely, sharpening rep- 
resented modification of the original percept in the direction of 
emphasizing and elaborating details. Allport and Postman (1958) 
used these terms to describe similar systematic changes occurring in 
the transmission of rumor. 

In a series of studies conducted at the Menninger Foundation, 
Holzman and Klein (1951), Holzman (1954);'and Gardner, Holz- 
man, Klein, Linton, and Spence (1959) sought to define the level- 
ing-sharpening dimension. Poles of this dimension were defined in 
terms of two opposite hypothetical modes of perceptual and mem- 
ory organization, having relevance to consistent individual differ- 
ences in cognition and personality. The experimental definition of 
leveling-sharpening was based upon performance in the Schema- 
tizing Test. This test, an adaptation of the Squares Test used for 
other purposes by Hollingworth (1913), requires Ss to judge the ab- 
solute size of squares projected on a sereen one at a time. Squares of 
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14 different sizes 'are'arranged in subsets of five, the five smallest 
squares forming the first subset. Each succeeding set of five squares 
is formed by eliminating the smallest square of the preceding set 
and adding the next larger square that has not yet appeared in the 
test. The squares of each subset are presented to the Ss first in order 
of ascending size and then in two fixed, predetermined random or- 
ders. The squares are presented in a series with no extra time break 
between the subsets. Thus, as the test proceeds, there is a gradual 
but irregular increase in the size of squares. 

Two styles of responding to the test, presumably a bipolar di- 
mension, have been identified: leveling, the general and progres- 
sive failure to keep up with the trend of increasing size of the 
squares; and sharpening, the absence or minimization of that lag or 
the anticipation of the trend of change. Leveling and sharpening 
tendencies are thought to be revealed in another respect, namely 
“ranking accuracy.” When the ranks of the absolute size judge- 
ments of each S within each of the subsets are compared with the 
ranks of the size of the stimuli of each set, “ranking accuracy” 
scores may be obtained. These scores have served as indicants for 
assigning Ss to sharpening or leveling categories (Holzman, 1954; 
Holzman and Klein, 1954). 

The problems in leveling-sharpening begin with the Schematizing 
Test, which is tedious for Ss and expensive to administer and score, 
thus discouraging some investigators from undertaking research in 
the area at all, while others study samples of inadequate size. The 
notable lack of system or clear rationale in scoring has given rise to 
numerous problems, several of which have been carefully expli- 
cated by Krathwohl and Cronbach (1956). Neither the appropriate 
measurement operations nor the hypothesized underlying processes 
have been unequivocally spelled out. Scores based both on varia- 
tions in lag and on ranking accuracy have been used as criteria for 
selecting groups of levelers and sharpeners, often without specifying 
which type of score was used or how multiple scores were combined 
(Holzman, 1954; Holzman and Gardner, 1959; Holzman and Klein, 
1954). This fact renders the research findings very ambiguous, since 
such correlations as have been reported between lag and ranking 
accuracy have not been sufficiently high to justify interchanging the 
scores (cf. Gardner, Jackson, and Messick, 1960). Indeed, the un- 
availability of experimentally independent scores of accuracy and 
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lag make any obtained correlations and factor patterns difficult to 
interpret. 

Both as a cause and as a result of equivocal research findings, the 
conceptualization of leveling-sharpening with regard to basic proc- 
esses and perceptual, cognitive, and personality correlates has at 
times lacked clarity and denotative specificity. The principal proc- 
ess hypothesis, which ascribes leveling to perceptual assimilation 
and sharpening either to less assimilation or to contrast tendencies 
(Holzman and Klein, 1950; Klein, 1951) , has received some support 
as regards time error assimilation effects (Gardner et al., 1959; 
Holzman, 1954; Holzman and Klein, 1951, 1954), but in other stud- 
ies the assimilation hypothesis has not received unequivocal sup- 
port (Gardner et al., 1959; Holzman, 1954; Murney, 1955). 

Some evidence has been found for the importance of leveling- 
sharpening both in immediate memory organization (Gardner and 
Lohrenz, 1959), and delayed Memory organization (Holzman and 
Gardner, 1960). In addition, suggestive evidence of a relation to re- 
pression (Gardner et al., 1959; Holzman and Gardner, 1959; Lach- 
mann, Lapkin, and Handelman, 1962) has been found, Holzman 
and Klein (1950) and Klein (1951) reported that levelers were 
judged by psychiatrists to be passive, constricted, rigid, avoidant of 
competition, and naive, as compared to sharpeners. In addition, 
they have suggested (Holzman, 1954; Klein, 1951) that sharpeners 
may show greater cognitive complexity than levelers. Tear and 
Guthrie (1955) found levelers nominated more often as cooperative, 
and sharpeners as competitive, by their fraternity brothers. Krath- 
wohl and Cronbach found (1956) sharpeners tended toward greater 
rigidity in questionnaire responses, but Murney (1955) obtained no 
significant differences between levelers and sharpeners on Q sorts of 
self-descriptive items. Although Holzman and Klein (1950) re- 
ported a relationship between sharpening and embedded-figures test 
performance, subsequent studies have failed to support this finding 
(Gardner, Jackson, and Messick, 1960; Murney, 1955) ; indeed, 
such tests define a quite different perceptual and personality di- 
mension, field independence (Jackson, Messick, and Myers, 1964; 
Witkin et al., 1954; Witkin et al., 1962). 

In approaching the problems of defining cognitive styles in the 
schematizing process, we considered it essential to use more than 
one score to measure components of this hypothetical dimension (cf. 
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Campbell and Fiske, 1959). Therefore, different simplified group- 
administered test forms of the schematizing process were designed 
(cf. Vick and Jackson, 1964), permitting both an evaluation of in- 
ternal consistency and cross-method correlation of components. 
The validation process included an appraisal of the correlations of 
these leveling-sharpening component scores with measures of: (a) 
perceptual-cognitive contrast tendencies, i.e., criticalness (Frederik- 
sen and Messick, 1959), perceptual criticalness (Vick and Jackson, 
1964), and category width (Pettigrew, 1958); (b) art judgement 
complexity; and (с) selected personality traits, as measured by 
trait ratings. To assess discriminant validity, correlations with cer- 
tain mental abilities, field independence, age, sex, and academic per- 
formance were obtained. Finally, the data were examined for 
information as to the consistency and generality of leveling- 
sharpening tendencies. Ў 


* 


Method 


Subjects 


Subjects were volunteers recruited from introductory psychology 
classes. Of 204 tested, eight had at least one unscorable test and 
were not included in the analysis. The final sample consisted of 196 
Ss, 109 men and 87 women. The age range was 17 to 26 years, with 
a mean of 19.0 and a standard deviation of 1.3 years. 


Procedure 


Each S served in one of four identical three-hour sessions, during 
which 12 tests were administered. The tests used are described be- 
low together with the variables they measured and reasons for their 
inclusion. 

The Squares Lag Test and Dots Lag Test were designed for this 
study on the basis of previous results (Vick and Jackson, 1964), the 
former in an effort to represent previously used methods of measur- 
ing leveling-sharpening, and the latter in order to assess the gener- 
ality of the dimension. Both tests were analogous to the Holzman- 
Klein Schematizing test, but shorter and simpler, requiring Ss to 
make 68 absolute judgements (size of squares or number of dots) of 
20 stimuli. In each test the stimuli were projected on a motion pic- 
ture screen singly and successively, 3 sec. on and 6 sec. off. Each 
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test commenced with the smallest four stimuli, shown once in ran- 
dom order. In each test succeeding sets were formed by dropping 
the smallest (or least numerous) stimulus from the preceding set 
and incorporating the stimulus next larger (or more numerous) than 
any previously presented. Thus, 17 sets of four stimuli were formed 
for each test. Adjacent stimuli in the basic squares series were dis- 
eriminably different by paired comparisons; adjacent dots stimuli 
were not. Instructions stressed accuracy and alertness. In each test 
the first four slides were for demonstration purposes; they were 
shown in a separate random order forward and backward, while E 
labeled each stimulus as to size of squares or number of dots. Judge- 
ments were made by circling fixed categories on & prepared answer 
sheet, a modified successive intervals procedure. Three extra cate- 
gories below and 11 above the actual limits of the stimulus series 
were provided. 

Four scores were obtained which reflected the lag-anticipation 
tendencies described by Holzman and Klein (1951); these were 
termed Harly Slope, Late Slope, Over-all Slope, and Increment. 
The first three represented the slope of the regression of judgement 
totals on stimulus totals for the first 8, last 8, and all 17 sets of 
stimuli, respectively. Increment was a simplified version of Incre- 
ment Error (Gardner et al., 1959) in which certain mathematically 
constant values were eliminated. The four scores will be referred to 
as lag scores, although scored in the direction of anticipation. 

Two additional scores representing ranking accuracy and ranking 
error were obtained. The Accuracy Score (the former Ranking Ac- 
curacy score) was the number of agreements between judgement 
ranks and stimulus ranks within sets of stimuli. The Error score 
was refined somewhat to take account of the magnitude, as well as 
the frequency of ranking errors. Error was the sum of the absolute 
values of discrepancies between intraset stimulus and judgement 
ranks, 

The Alternative Expressions Test was used to measure critical- 
ness response set (Frederiksen and Messick, 1959), on the hypothe- 
sis that criticalness and leveling are negatively related. It requires 
Ss to judge whether the meaning of each of 70 sentences would be 
changed in any practical way by substituting a given word or 
phrase for one indicated in the sentence. The response set score D, 
number of judgements “different,” was used. 
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The Angles and Squares Similarity-Difference Tests were de- 
signed to test the generality of critiealness over simple perceptual 
judgements and were tentatively labeled tests of “perceptual critical- 
ness.” Each stimulus for the Angles Test was a slide showing two 
straight black lines, joined at one end to form an acute angle. All 
angles were oriented the same way, with the base line horizontal. 
Squares stimuli were the same as for the Squares Lag Test. Hach 
test comprised 20 pairs of stimuli, presented on a screen singly and 
successively for 3 sec. each, with intervals of 1.5 sec. between pair 
members and 6 sec. between pairs. Ss were to judge whether the 
stimuli in each pair were the same or different, The members of half 
the pairs in each test were discriminably different by paired com- 
parisons; members of the remaining pairs in each test were identi- 
cal. Scores were obtained separately for the Angles and the Squares 
Similarity-Difference Tests and combined for a total; they were 
the number of correct judgements (R-+), and the number of judge- 
ments of “different” (D). 

The Estimation Questionnaire (Pettigrew, 1958), a measure of 
category width, requires Ss to estimate the upper and lower ex- 
tremes of the frequencies when given the average frequency of 
some event. On the hypothesis that Ss who tend to exaggerate differ- 
ences between stimuli would use narrower categories than those 
who tend to minimize differences, this test was expected to correlate 
with sharpening and with criticalness. Scoring followed Pettigrew's 
(1958) procedure. 

The Figure Choices Test was used to appraise the hypothesis that 
habitual leveling results in a generally impoverished, simplified 
cognitive structure. Barron (1953, 1955) used the Barron-Welsh Art 
Scale (Barron and Welsh, 1952) as a measure of preference for 
complexity or simplicity. A forced-choice form of the test was de- 
veloped by Sechrest and Jackson (1961) to reduce response set 
variance. Figure Choices is a forced-choice booklet form of the Bar- 
ron-Welsh scale prepared by Messick and Kogan. 3 It is scored for 
the number of preferences S expresses for the more complex member 
of pairs of figures. 

A group-administered Embedded Figures Test, a measure of field 
independence (Witkin et al., 1954), was included, in the light of the 


8A report of the larger study for which Figure Choices was designed is in 
preparation. 
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status of field independence as a separate cognitive style, to provide 
data relevant to the discriminant validity of the lag tests as meas- 
ures of leveling-sharpening. The embedded figures form was de- 
signed by Jackson, Messick, and Myers (1964) to minimize the role 
of memory organization in performance and to provide for conven- 
ient group administration. The test was administered with a 12-min. 
time limit and scored for the number of simple figures correctly 
identified in the test figures. 

The Trait Rating Form was used to test hypotheses concerning 
relationships between leveling-sharpening and certain personality 
variables. This instrument, developed by Jackson, is designed to 
assess personality through Ss' judgements of desirability of 300 
traits which were chosen to represent 11 of Murray’s need concepts, 
as well as desirability response bias, with 25 items in each scale. 
Ratings of traits were on a nine-point scale. Scores were based on 
the sum of the ratings. 

Advanced Vocabulary, Mathematics Aptitude, and Perceptual 
Speed Tests were included as a further check on the discriminant 
validity of the Squares Lag Test. The Advanced Vocabulary and 
Mathematics Aptitude Tests were taken from French’s (1954) kit 
of factored tests; they are classified as measures of verbal knowl- 
edge and general reasoning, respectively. The Perceptual Speed 
Test is from the Guilford-Zimmerman Aptitude Survey (1947). The 
ability test scores were the number of correct responses. 

Age, Sex, and academic Grade Point Average were obtained from 
each S so that the correlation of these variables with leveling- 
sharpening might be appraised. 


Results and Discussion 


The means and SDs of leveling-sharpening measures are pre- 
sented in Table 1. Note that the lag scores and Accuracy are higher, 
and Error lower in the Squares Lag Test than in the Dots Lag Test. 
These differences reflect the fact that the smallest difference be- 
tween squares was easily discriminable, whereas this was not the 
case for the dots. 

Table 2 shows the corrected odd-even reliability coefficients of the 
measures. Of special interest are the estimates of internal consist- 
ency for the Squares and Dots Lag Test scores. The score “Х” rep- 
resents the sum of judgements within each set of four stimuli in the 
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TABLE 1 
Means and Standard Deviations of Leveling-Sharpening Scores 
(N = 196) 
=————Є————————ЄЄ—Є————ЄЄ————ЄїЄ 
Lag Test 
Squares Dots 
Variable Mean SD Mean SD 
Early Slope .85 68 .54 .57 
Late Slope +75 75 .12 .49 
Over-all Slope .76 .65 .35 42 
Increment 2.18 82 1.10 .10 
Accuracy 38.18 7.64 16.08 5.42 
Error 26.90 8.67 66.84 10.59 
——— DIE Д e esr Me il 
TABLE 2 
Corrected Odd-Even Reliabilities of Test Score Variables 
(N = 196) 
Variable T Variable at 
Squares Lag Test Estimation Questionnaire .83 
X. .99 Embedded Figures 79 
Accuracy .56 Figure Choices .91 
Error .56 Mathematics Aptitude .78 
Dots Lag Test Advanced Vocabulary .70 
x* .98 Trait Rating Form 
Accuracy .33 n Achievement, .91 
Error .38 п Impulsion 4 
Similarity Difference n Dominance .86 
Tests n Understanding .82 
Angles R+ 05 n Autonomy .83 
Angles D .87 n Aggression .88 
Squares R+ —.05 n Abasement .88 
Squares D .23 п Rejection „84 
Total R+ —.07 п Harmavoidance .79 
Total D .46 n Exhibition 81 
Alternative Expressions .81 п Affiliation .90 


Note.— Correlations required for significance at .05 and .01 levels: .14 and .19 respectively. 
aX is sum of judgements within seta. 
ЪЁ + is number correct; D is number judged different. 


lag tests. Thus the reliabilities of X for the two lag tests represent 
the internal consistencies of the scores upon which Increment and 
Over-all Slope are based, and may be taken as estimates of the re- 
liabilities of Increment and over-all Slope. The uncorrected odd- 
even reliability coefficients for X in the two tests may be considered 
to be estimates of the reliability of Early Slope and Late Slope. The 
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correlations indicate that trends in judgements are very consistent 
from one set of stimuli to the next. 

Although less consistent than the lag scores, both Accuracy and 
Error show reliabilities in the moderate range, significant well be- 
yond the .01 level. Both Accuracy and Error are somewhat less re- 
liable in the Dots than in the Squares Lag Test. This difference 
probably resulted from the difference in discriminability of stimuli. 

These data for the lag tests make it apparent that all the lag 
measures possess satisfactory internal consistency, and that the 
Accuracy and Error scores, though less reliable, are sufficiently 
high so as to be of further interest. The data support the conclusion 
that on both tasks, Ss show the individual differences in judgements 
that have been labeled leveling and sharpening by the Menninger 
investigators. 

In the Similarity-Difference Test the reliabilities of R- (total 
number of correct judgements) were not significantly different from 
zero, while those of D (number of judgements “different”) were all 
greater than zero at better than the .01 level of significance, though 
low to moderate in magnitude. Thus, in Similarity-Difference Test 
scores, simple accuracy of judgement was not consistent for a given 
S from one item to the next, despite the composition of the tests 
which insured the discriminability of the members of the pairs 
keyed “different.” The data indicate that this type of test evokes a 
set to respond “different” which is more consistent than the tend- 
ency to judge differences accurately. The various scales of the Trait 
Rating Form generally displayed substantial internal consistency. 
The reliabilities of seven of the scales were in the .80’s, with three 
lower and two higher, the median being .83. These are moderate to 
high reliabilities for personality scales. 


Intercorrelations of Variables 


Pearson product-moment correlations were obtained between all 
relevant pairs of variables. 

Lag Scores. Lag or anticipation (Slope and Increment scores) ap- 
pears to be a fairly consistent set of individual differences within 
each of the lag tests, but is more complex than was revealed in pre- 
vious research, (e.g., Gardner et al., 1959). Intercorrelations of lag 
Scores were generally high (Tables 3 and 4), but not so high as to 
suggest their interchangeability. In both tests, the lowest correla- 
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tions among lag measures were between Early Slope and Late Slope. 
The latter results, although not reproducing exactly the findings of 


TABLE 3 
Intercorrelations of Squares Lag Test Scores 
(N = 196) 
Score Early Slope Late Slope Over-all Slope Increment Accuracy 

Late Slope Al 
Over-all Slope .62 .87 
Increment .77 .76 .92 
Accuracy .18 .16 .20 .22 
Error —.06 —.05 —.06 —.06 —.84 


Note.—Correlations required for significance at the .05 and .01 levels: .14 and .19 respectively. 


TABLE 4 
Intercorrelations of Dots Lag Test Scores 
(N = 196) 


Score Early Slope Late Slope Over-all Slope Increment Accuracy 
РЎИ? о Miers droit ааа IR es reek cranii 
Late Slope 17 
Over-all Slope 67 62 
Increment 72 39 77 
Accuracy .14 —.03 .04 14 
Error .04 .00 .05 .07 —.57 


Note.— Correlations required for significance at the .05 and .01 levels: .14 and .19 respectively. 


Krathwohl and Cronbach (1956) because the present correlations 
were statistically significant, do lend support to the conclusion of 
those investigators that Early Slope and Late Slope contain differ- 
ent information, All the lag scores will probably prove useful in 
further analytical work on leveling-sharpening, though one may 
question the unique value of a separate Increment score because of 
its high correlation with Over-all Slope. 

The hypothesis that lag is an adaptation-level or frame of refer- 
ence phenomenon receives some support from the correlations of 
Squares Lag Test scores with their counterparts in the Dots Lag 
Test. Table 5 shows that lag scores generally held up fairly well 
across the two tests, considering the rather marked differences be- 
tween the tests in type and discriminability of stimuli and type of 
judgement required, as well as the probable operation of sequence 
effects. Of considerable interest is the exception to this rule—Dots 
Late Slope—which was correlated about zero with all Squares Lag 
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TABLE 5 
Correlations " aae Across Lag Tests 


Squares Test 
Early Late Over-all 

Dots Test Slope Slope Slope Increment Accuracy Error 
О E al а 
Early Slope 43 42 48 49 16 —.04 
Late Slope —.06 12 il 06 06 —.05 
Over-all Slope .29 .38 42 41 06 02 
Increment 31 .34 38 38 06 01 
Accuracy 18 .08 .08 11 12 —.09 
Error —.02  .20 Bt] 10 —.06 11 


e „шш 
Note,—Correlations required for significance at the .05 and .01 levels: .14 and .19 respectively. 


Test scores. In contrast, Dots Early Slope was correlated about as 
well with all Squares Lag Test lag scores as each of the latter cor- 
related with its own Dots Lag Test counterpart. These results sug- 
gest strongly that a mode of responding adopted in the Squares Lag 
Test catried over well into the Dots Lag Test for most Ss, and was 
‘then abandoned by them. Only further research can clarify the 
reasons for these results, but the data suggest that the phenomena 
involved may be individual differences in anchoring effects in judge- 
ments (Sherif and Hovland, 1961). 

Lag and Accuracy. Apparently lag is only slightly related to 
ranking accuracy, although these two aspects of performance have 
been used rather indiscriminately as referents of “leveling- 
sharpening.” The data (Tables 3 and 4) did not permit the rejection 
of the null hypothesis as regards a relationship between lag vs. Er- 
ror, but did permit its rejection for lag scores vs. Accuracy in the 
Squares Lag Test and for Dots Early Slope and Increment. The im- 
portance of this finding derives from the fact that the assumption, 
made by some previous investigators, that the two types of scores 
are measuring essentially the same trait of leveling-sharpening was 
one basis for conceptualizing leveling-sharpening in terms of assim- 
ilation vs. contrast. That is, lag was regarded as a consequence of 
the operation of assimilation tendencies in sequential judgements 
(Holzman, 1954). The present correlations, however, do not support 
the equivalence of lag and accuracy scores. Moreover, it appears 
that the measures of accuracy are more susceptible than the lag 
scores to change in situational or task variables. Neither Accuracy 
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nor Error in the Squares Lag Test correlated significantly with its 
analogous score in the Dots Lag Test (Table 5). 

Further data must be examined before much can be said about 
the nature of the accuracy variable itself. However, the data in 
Tables 3, 4, and 5 are sufficient to cast serious doubt on the func- 
tional unity of leveling-sharpening as previously defined. The lat- 
ter statement holds whether one is referring to defining operations 
or to the hypotheses relevant to this cognitive style. It is at least 
clear that a more careful analysis of the leveling-sharpening for- 
mulation is needed and probable that it will have to be drastically 
redefined. Research aimed at further clarification of the concept will 
need to use the data from all Ss, not merely that from extreme 
groups (Holzman, 1954). The practice of using extreme groups may 
have produced statistically significant differences in the predicted 
directions which lack both practical and theoretical importance be- 
cause of low relationships between the variables when all the data 
are considered (cf. Bolles and Messick, 1958). 

Lag vs. Other Variables. Discussion of the correlates of leveling- 
sharpening will be confined to relationships involving scores on the 
Squares Lag Test. The latter shows a higher and clearer patterning 
of correlations than does the Dots Lag Test, a finding probably at- 
tributable to the higher reliability of the Squares Lag Test. 

Little evidence was obtained that lag is related to age, sex, or 
academic achievement or ability variables. None of the correlations 
between lag scores and age, grade point average, or the Mathe- 
matics Aptitude Test was significant, which offers some evidence for 
the discriminant validity of lag scores. On the other hand, Ad- 
vanced Vocabulary was correlated significantly at the .05 level 
with Early Slope and nearly so with Increment; Perceptual Speed 
was correlated with Over-all Slope and Increment at the .05 level; 
and sex was correlated with Early Slope at the .01 level and with 
Increment at the .05 level of significance. These correlations may 
best be regarded as leads for further research rather than as a basis 
for firm conclusions. 

The assimilation vs. contrast hypothesis received little support in 
this study as an explanation of lag. Lag scores were not signifi- 
cantly correlated with Similarity-Difference Test or Alternative 
Expressions Test D scores, or with the Estimation Questionnaire. 
In fact, no evidence was obtained for a general trait of perceptual 
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contrast. The Similarity-Difference tests were designed for the 
study and the Alternative Expressions Test and Estimation Ques- 
tionnaire were included in the study because performance on those 
tests could be conceptualized readily in terms of a style of percep- 
tual contrast. However, although the Similarity-Difference Test D 
scores were correlated significantly at the .01 level (.32), lending 
some support to the contrast trait, neither of those scores was corre- 
lated significantly with the Alternative Expressions Test D (a 
measure of criticalness) or with the Estimation Questionnaire 
(category width), nor were the latter two tests significantly corre- 
lated. 

Only one of the four correlations between lag scores and Figure 
Choices (Early Slope) was significant at the .05 level so that, while 
а suggestion of a relationship worth following up in future research 
exists, caution is called for. Evidence linking lag and ability to ex- 
tract items from context as in the Embedded Figures Test was not 
obtained. This is in line with most previous findings and supports 
the discriminant validity of the lag scores. Furthermore, the pres- 
ent data offer little support for the hypothesis that lag is related to 
personality variables as measured by the Trait Rating Form. 

Accuracy and Error Scores. As noted above, the safest conclusion 
to be made from these and previously published results is that 
ranking accuracy and lag are different variables, although some- 
what correlated. The next question to be answered is: what is the 
nature of the process (or processes) reflected in the Accuracy and 
Error scores? The data in Tables 3 and 4 make clear the fact that 
Accuracy and Error, though not identical, measure something very 
similar. They were correlated —.84 in the Squares Lag Test and 
—.57 in the Dots Lag Test. Neither score showed statistically sig- 
nificant stability over tests (Table 5). A revision of the Dots Lag 
Test, with the stimulus series arranged more like that of the Squares 
Lag Test might increase the stability of these scores across the tests 
to an acceptable level. 

Accuracy vs. Other Variables. Accuracy appears to be related to 
general reasoning (Mathematics Aptitude Test) and Possibly to 
perceptual speed and age, but not to verbal ability, sex, or academic 
performance. Table 6 shows that the Mathematics Aptitude Test 
was significantly correlated with both Accuracy and Error, posi- 
tively with the former and negatively with the latter. French (1954) 


VICK AND JACKSON 281 


TABLE 6 


Correlations of Accuracy and Error Scores with Ability Tests, 
Age, Sex, Academic Achievement, and Cognitive Variables 


(N = 196) 

Variable Accuracy Error Variable Accuracy Error 

Mathematics Similarity-Differ- 
Aptitude .20 .21 ence Tests* [ 
Advanced Vocabu- Angles R+ —.03 .05 
lary .03 .05 Angles D дї —.02 
Perceptual Speed 4 —.11 Squares R+ 15), —.19 
Age JM  —.13 Squares D .20  —.12 
Bex —.01 .05 Total R+ .10  —.15 
Grade Point Average .05  —.07 "Total D 18  —.08 

Alternative Estimation 

Expressions 12  —.12 Questionnaire 10 —.07 
Embedded Figures 16 —.12 Figure Choices — —.01 .09 


Note.—Accuracy and Error are from Squares Lag Test. Correlations required for significance 
at the .05 and .01 levels: .14 and .19 respectively. 
aR + is judgements correct; D is judgements “different.” 


classified this test as one of general reasoning, but one is reluctant 
to interpret the present correlations as indicating that accuracy 
scores would load on that factor; more probably perceptual ac- 
curacy, or carefulness in performing tasks, contribute to both the 
Mathematics Aptitude and accuracy scores. It is tempting to in- 
terpret the near-significant correlations of age and perceptual speed 
with the accuracy scores, but in the interest of caution it is prob- 
ably best to regard them as merely indicating a possible relation- 
ship worthy of further attention. 

The question was raised earlier as to whether ranking accuracy 
represents an aspect of accuracy of perceiving, or a set such as per- 
ceptual criticalness or contrast. Table 6 contains the data relevant 
to this question; none of the correlations is high enough to justify 
firm conclusions, but several are statistically significant. The most 
striking feature of these correlations is that Accuracy was corre- 
lated positively and Error negatively with all scores (except R+ on 
the Angles Test) of the Similarity-Difference Tests, both those 
designed to measure accuracy of perception and those reflecting 
consistent tendencies to judge stimuli to be different. This pattern 
included the Alternative Expressions Test D score, and the Em- 
bedded Figures Test as well. It seems likely, considering all these 
results together, that both keenness of discrimination and set to 
respond “different” to pairs of stimuli may contribute to Accuracy 
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and Error, but do not account for a large proportion of the vari- 
ance in ranking accuracy. 

Accuracy was correlated with n Autonomy, and Error was corre- 
lated with n Autonomy and with n Impulsion, all significant at the 
-05 level, but none accounting for much variance. It is noteworthy 
that the correlations with n Autonomy represent a reversal of the 
findings of Klein (1951). There were no other significant correla- 
lions with personality variables, but in every case where there was 
a directional prediction from Klein's results, the direction was re- 
versed in the present results. The correlations with n Autonomy 
and n Impulsion suggest again that accuracy represents acceptance 
of instructions and carefulness in performance. Considering the 
rather monotonous nature of the judgement tasks, which required 
prolonged and careful attention to detail, these significant correla- 
tions are not surprising. It would be of some interest to manipulate 
task variables such as adaptive consequences of perceptual ac- 
curacy and of task boredom as these affect relationships with per- 
sonality variables. 

One conclusion to be drawn from these data is that individual 
differences in lag or anticipation can be measured with high in- 
ternal consistency using briefer procedures than the Holzman- 
Klein Schematizing Test. The new methods also provide moderately 
reliable measures of the ranking accuracy variable. However, fur- 
ther refinement of the tests and simpler scoring procedures are still 
desirable. 

The major conclusion of this research is that “leveling- 
sharpening” is badly in need of further analytical research and 
reformulation. It is doubtful that leveling-sharpening refers to a 
unitary dimension in behavior. Rather, it appears to comprise phe- 
nomena which are operationally complex and to involve a set of 
hypotheses so loosely formulated and interrelated as to compound 
the usual difficulties of operational specification and appraisal of 
validity for describing or predicting behavior. 

Accuracy and lag tendencies do not share enough variance to be 
regarded as measures of the same dimension. Hence lumping these 
two measures together would appear to be illogical. Because only 
ranking accuracy, and not lag, showed any sign of correlating sig- 
nificantly with variables supposedly related to leveling-sharpening, 
the possibility exists that many previous findings reported in the 
literature were primarily attributable to accuracy and not to lag. 


—— 
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In any event, some decision must be made concerning what is to be 
the referent of “leveling-sharpening.” Further research will be re- 
quired to determine which of the components is the most useful 
referent, considering all the ways these terms have been used in the 
past. ә 

Research is needed to obtain better measures of the processes in- 
volved in Accuracy, and to explore the nature of the relation be- 
tween Accuracy and Error. Research should also be aimed at ex- 
plicating the relationship between lag and anchoring or frame of 
reference effects, a question raised by the present and previous data. 
The similarities between these two groups of phenomena are com- 
pelling, and in fact the terms “assimilation” and “contrast” have 
been used to describe anchoring effects, although with different op- 
erational referents than those given these terms by the Menninger 
group (cf. Sherif and Hovland, 1961). 

The hypothesis that leveling-sharpening cognitive styles relate 
substantially to major personality variables is not unequivocally 
established. In the present study, the suggestion is that most of the 
previously reported relationships may possibly involve accuracy, 
but not lag, and some of them may not be in the direction predicted 
by Holzman and Klein. In any case, further studies of correlations 
between accuracy and/or lag and personality tendencies need to 
make use of multiple indicants of the personality variables in- 
volved, as well as experimental manipulation of such factors as the 
adaptive consequences of accuracy, the role of boredom, and other 
influences upon the social psychology of the assessee (cf. Riecken, 
1962). 

Although the results of this research are discouraging to the view 
either that leveling-sharpening is a unitary dimension or that it en- 
joys substantial correlations with personality variables, much more 
research needs to be done before efforts to understand it can be 
justifiably abandoned. The ubiquity of behavior like leveling and 
sharpening in studies of memory (Wulf, 1922; Allport, 1930), 
judgement (Sherif and Hovland, 1961) and rumor (Allport and 
Postman, 1958), suggest that with further operational clarification 
some variable or variables like leveling and sharpening will prove 
to be of value in psychology. What is important is to demand that 
care in development of experimental measures keep pace with 
theory and to determine the nature of the basic processes involved 
before putting so much emphasis on the correlates of a putative 
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variable which is itself so little understood. The concept “leveling- 
sharpening” like any other, must have some fairly clear referent if 
interest in its relationship to other concepts is to be justified. 


Summary 


In a systematic attempt to define the leveling-sharpening con- 
struct, simplified group measures were developed. The Squares Lag 
Test and Dots Lag Test were designed to yield various scores of 
adaptation lag and ranking accuracy, the two major components of 
leveling-sharpening. These were administered to 196 Ss, together 
with tests of perceptual contrast, criticalness, category width, art- 
Judgement complexity, traditional personality need variables, as well 
as with measures of field independence, intellectual abilities, age, 
sex, and academic achievement. 

Internal consistency for lag measures was high in both Squares 
and Dots Lag Tests; for accuracy measures it was moderate. Lag 
showed greater cross-method stability than did accuracy. Both tests 
yielded significant but quite low correlations between accuracy and 
lag. 

Accuracy scores showed some evidence of correlation with ex- 
ternal variables; lag scores generally did not. There was little to 
support the assimilation-contrast hypothesis. 

While lag and accuracy can each be measured reliably, they rep- 
resent largely different processes. It was concluded that leveling- 
sharpening does not refer to a unitary set of behavioral referents, 
and hence is in need of reformulation. It is suggested that lag may 
represent individual differences in anchoring effects, and accuracy 
some combination of perceptual contrast set, keenness or acuity of 
discrimination, and capacity for prolonged attention to task re- 
quirements and instructions. 
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A “CONSTRUCT ITEM ANALYSIS” 
OF SOME APTITUDE TESTS! 


RALPH HOEPFNER 
University of Southern California 


In item-analysis procedures, certain items of a test are retained 
and others discarded in order to improve the reliability or validity 
of the total scores of the test. The advantages of thereby producing 
a shorter and possibly more reliable and valid test have been ob- 
tained generally in research concerned with predictive- and*concur- 
rent-validity problems. Tiffin and Hudson (1956) found that the 
validity and reliability estimates of a test, item analyzed by two 
methods to reduce the number of items, were not meaningfully 
lowered, even when only 56 per cent of the items were retained. 
Webster (1956; 1957) reviewed procedures developed for item 
analyzing tests to maximize their internal-consistency reliabilities 
and total-score correlations with a criterion, He presented applica- 
tions wherein his methods succeeded in increasing test validity and 
homogeneity. 

When the immediate objective is to improve the construct valid- 
ity of a test, however, item-analysis techniques have been applied 
only indirectly. For example, it could be stated that factor-analyzed 
interest and temperament inyentory scales have undergone item 
analysis—those items hypothesized to measure the construct (trait) 
that do cohere upon a factor defined by other items hypothesized 
to measure the construct are incorporated into the scale for that 
trait. Items that are not loaded upon the factor in question, or items 
displaying complexity, are not incorporated into the scale. 

Several investigators have approached construct item analysis 
more directly. In his factor analysis of the items of the Progressive 
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Matrices Test, a test sufficiently internally consistent to ensure the 
dominance of a large general factor among the items, Vernon (1950) 
concluded that retaining items having the greatest loadings on the 
first centroid factor is hardly worthwhile. Henrysson's (1962) dem- 
onstration that point-biserial coefficients between items and total- 
test scores are identical with the first centroid factor loadings di- 
vided by the item standard deviations, leads to a similar 
conclusion regarding the value of item-total item analysis for “univ- 
ocal” tests of a construct. 

The reason for the negligible effects of such item-analysis proce- 
dures has been explained by Nihira (1963). Since the first centroid 
factor lies at the center of gravity of the item configuration, the po- 
sition of the axis depends upon the intercorrelations of all the items 
analyzed, whether or not they are saturated upon the dimension 
under investigation. This circumstance, of course, lessens our as- 
surance that the first centroid axis actually represents univocally 
the construct that the test was designed to measure. Nihira further 
points out that what is needed to solve the dilemma is to rotate the 
first centroid axis to a position where it will represent more clearly 
the construct under consideration. Since there is no criterion for 
such a rotation in the item-total factor matrix, however, additional 
variates, known to reflect the factor, would have to be included in 
the analysis to guide the rotations. 

A similar effect is obtained in the method of successive item anal- 
ysis proposed by Wherry, Campbell, and Perloff (1951). In this 
procedure, the total-test score (criterion) is successively redeter- 
mined to represent only the homogeneous items in the test. Aside 
from the computational difficulties involved in such successive anal- 
yses and the problem of deciding when items are no longer homo- 
geneous, the underlying shortcomings of item-total correlations, 
elaborated by Nihira, are not overcome. 

It appears that whether factor-analytie or item-total correlation 
methods of analysis are employed, an independent criterion of the 
construct is necessary for the selection of items in the same way 
that an independent criterion is necessary for selecting items to im- 
prove predictive validity. The occasion for obtaining not only posi- 
tive examples of the criterion (the construct), but negative ones 
also (other constructs), arose from a large factor analysis reported 
by Guilford, Merrifield, Christensen, and Frick (1961). Since all 
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the factors in this analysis were defined by two or more tests, each 
test could be item analyzed in two ways: (a) items could be se- 
lected that correlate highly with the total scores of tests that co- 
hered upon a specified factor, and (b) items could be deleted that 
correlate highly with total scores on tests that cohered upon other 
factors, partieularly factors showing some correlation with the test 
under consideration. 

The further need for the readministration and reanalysis of 15 
of the 30 tests reported in the 1961 analysis (Hoepfner, Guilford, 
and Merrifield, 1964) offered an opportunity to “cross validate,” in 
а sense, the item-analytic procedure. 


Procedure 


Fifteen of the 30 tests administered to 240 Naval Air Cadets and 
Aviation Officer Candidates were factor analyzed and rotated or- 
thogonally to the factor pattern obtained from the analysis of all 
30 tests. Only eight of the 13 original factors were represented in 
the subsample of tests. These 15 tests were then intensivély item 
analyzed on the naval sample according to the two criteria de- 
scribed above. Each item retained in a test met the criteria of (a) 
correlating highly with total-test scores loading on its hypothesized 
factor, and (b) correlating lowly with total-test scores loading on 
other selected factors. The other factors selected were those among 
the eight that appeared to be correlated with each test under con- 
sideration in the 1961 analysis. 

Upon completion of the item analysis of the tests, new total scores 
were computed, intercorrelated, and factor analyzed. The resulting 
factor pattern was once again rotated to the pattern reported by 
Guilford et al. (1961). The new factor pattern was, of course, spuri- 
ously biased in favor of meeting the hypothesized factor pattern, as 
the tests had been item analyzed, based upon the same sample of 
examinees and the same factor pattern. 

To test the effectiveness of the item-analysis procedure inde- 
pendently, the item-analyzed tests were administered to a new sam- 
ple in another large battery of symbolic tests (Hoepfner et al., 
1964). The new sample consisted of 225 high-school seniors. In 
spite of differences in age and sex composition, it was thought that 
the 1961 and 1964 samples would be approximately equivalent. 
Score data on the 15 tests showed this assumption to be reasonable. 
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It should be made clear, however, that the two test administrations 
differed not only in the items, length of tests, and examinees, but 
also in the sequence of test administration. The orders of adminis- 
tration were independent, being determined by the exigencies of 
each test battery. 

The scores obtained from the new sample of examinees were in- 
tercorrelated, factor analyzed, and rotated to the same eight-factor 
orthogonal pattern as the other two sets of scores. A comparison be- 
tween the first and second test administrations on the basis of test 
reliability and similarity of factor structure would constitute an 
unbiased evaluation of the item-analysis procedures employed. 

Each of the 15 tests is described briefly below. Following the test 
name is a trigram code indicating the factor composition hypothe- 
sized for that test. The letters stand for the kind of operation, kind 
of content, and kind of product, respectively, as defined by the di- 
mensions of Guilford’s model of intelligence (Guilford and Merri- 
field, 1960). 

1. Carhouflaged Words—NST. Find within a meaningful sentence 
a group of consecutive letters that spell the name of a sport or 
game. 

2. Circle Reasoning—CSS. Discover the principle by which one 
circle is blackened in each of four rows of circles and dashes. 
Disemvowelled Words—CSU. Recognize familiar words given 
with dashes in place of vowels. 

4. Letter Triangle—CSS. Discover the pattern of letters arranged 
systematically within a triangle. 

Number Classification—CSC. Select one of five alternative 
numbers to fit into each of four classes of given numbers. 

6. Number-Group Naming—CSC. State how the numbers in each 
set of three are alike. 

Numerical Operations—MSI. Perform simple numerical calcu- 
lations of addition, subtraction, and multiplication. 

8. Operations Sequence—NSS. Select the correct order of three 
specified numerical operations in order to get from one given 
number to another. 

9. Seeing Trends II—CSR. Recognize in a sequence of familiar 
words a trend based upon relations of letters. 

10. Symbol Growping—CSI. Rearrange scrambled symbols in a 
specified systematic order as efficiently as possible. 
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11. Word Changes—NSS. Arrange a list of words so that the first 
word is changed into the last word with one letter change at 
each step. 

12. Word Combinations—CSU. Produce a new word out of the 
ending of one word and the beginning of another. 

13. Word Patterns—CSI. Arrange a small list of short words effi- 
ciently in a crossword-puzzle design. 

14. Word Relations—CSR. Recognize the same spelling relation be- 
tween words in each of two pairs, then complete a third pair 
from given alternative words, using the same relation. 

15. Word Transformations—NST. Regroup the letters of words in a 
phrase in order to make a different set of words. 


Results 


Table 1 lists several of the descriptive statistics determined from 
the three analyses of the 15 tests. In all the tables, the first admin- 
istration to naval personnel is labeled “1961”; the reanalysis of 
those data is labeled “1962”; and the second administration‘to high- 
school seniors is labeled “1964.” The first column of Table 1 lists the 
number of items for each test at each analysis. The 1962 and 1964 
analyses have equal numbers of items, because the 1964 tests were 
based upon the items selected for retention in the 1962 reanalysis. 

The second column of Table 1 lists the time allowed for each test 
in the 1961 and 1964 administrations. Since the 1962 tests were 
never actually separately administered, but only analyzed forms 
of the 1961 tests, no times are listed for that analysis. It can be 
seen that there is a 25 per cent saving of time in the 1964 admin- 
istration of the tests, an important asset when large batteries of 
tests are to be administered to large groups of examinees. 

Since the reduction of time and items in 1964 might reasonably be 
expected to reduce the test reliabilities, estimates of the reliabilities 
for the 1961 and 1964 analyses can be found in the third column of 
Table 1. Although many of the reliability coefficients cannot be 
compared exactly over the two administrations, due to the altera- 
tions in numbers of parts of the tests and the subsequent need for 
computing different reliability coefficients, it appears that appro- 
priate construct item analysis does not reduce test reliabilities. 

The fourth column of Table 1 lists the mean scores on each test 
for each analysis. The means are expressed in terms of the propor- 
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tion of total possible points in order to account for differences in 
means due to time and test-length differences. It can be seen that 
test means for all analyses and both samples are generally very 
similar. 

The last column of Table 1 lists those factors employed as “other” 
constructs in the item analyses; i.e., those criterion factors em- 
ployed in determining the deletion of items from tests for each 
hypothesized factor because those items showed correlations with 
those other factors. 

Roughly comparable test statisties over all three analyses made 
the three score matrices appropriate for factorial comparisons. The 
goal of such comparisons was to factor analyze and rotate all three 
matrices to the meaningful structure reported in the 1961 analysis. 
Whether the item-analyzed matrices closely conformed to the factor 
structure of the original (1961) matrix, would be a test of the effec- 
tiveness of the item-analysis procedure for purifying tests of uni- 
factor constructs. 

Table 2 presents the intercorrelations among the 15 test variates 
for the three analyses. In each cell, the top correlation coefficient is 
computed from the 1961 data, the second coefficient from the 1962 
data, and the third coefficient from the 1964 data. In all three analy- 
ses, correlations between tests for the same hypothesized factors 
range from the .30’s to the .60’s, with no appreciable increases over 
those reported in 1961. The correlations between tests for different 
hypothesized factors show little difference in the desired direction 
(toward lower correlations) between the 1961 and 1964 analyses, 
The between-factor correlations for the 1962 tests are noticeably 
smaller, due to the fact that the item analysis was performed on the 
very same data and capitalized upon all the conditions influencing 
that set of intercorrelations. 

Table 3 is composed of the three rotated factor matrices ob- 
tained from each set of data. The correlation matrices were sub- 
mitted to an iterative communality-estimation program, and sta- 
bilized communality estimates were placed into the principal diag- 
onals for principal-axes extraction. Eight factors were extracted 
from each correlation matrix and were rotated toward fixed target 
matrices of loadings (Cliff, 1964). The target matrices were con- 
structed by placing the square root of the communality of each test 
as its loading upon its hypothesized factor, with loadings of .00 on 
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all other factors. Although such perfect simple-structure solutions 
could not be expected to be obtained, approximately equal rota- 
tional strains would be placed upon each matrix by the constructed 
target matrices. The approximate equality of rotational restric- 
tions, it was felt, made factor comparisons more meaningful than 
placing differential restrictions on each matrix in order to achieve 
some other rotational criterion. 


TABLE 3 
Rotated Factor Matrices for Three Analyses 


1961 Analysis 
Variable MSI OSU CSC CSR CSS CSI NSS NST h 


1 04 26 07 20 03 —04 17 55 45 
2. —02 19 05 24 34  —04 36 —04 35 
3, 10 67 12 31 00 12 07 26 66 
4. —04 04 05 28 46 22 33 13 47 
t —12 17 74 09  —01 10 37 —12 16 
7. 
8. 
9. 


23 03 74 02 09 06 00 21 66 

* 55 08 12 10 —06 08 32 01 45 
35 13 16 18 24 21 65 28 1 
—05 п 03 60 05 18 25 22 51 


10. 09 04 10 17 16 51 27 08 41 
11. -ll 14 21 35 27 24 60 19 72 
12. —04 56 10 05 17 22 22 37 59 
18. 00 37 11 19 04 33 33 02 40 
14, 13 27 09 45 38 16 35 16 62 
15. —08 44 03 20 07 08 38 44 55 
1962 Analysis 
Variable MSI OSU CSC CSR CSS OSI NSS мт мї 
1 01 24 02 19 07 02 07 47 33 
2. —14 09 01 20 36 06 31 00 30 
3. 06 69 12 33 00 10 07 17 65 
4. 06 13 06 25 54 08 20 10 43 
6. —16 12 72  —08 05 03 26 00 63 
6. 28 п 70 07 01 08  —06 06 60 
T. 47 09 14 12 —06 09 38 —01 42 
8. 28 13 14 21 26 26 66 21 77 
i —07 07 00 66 05 13 24 19 56 
п 08 07 16 16 62 21 00 50 
11 00 16 06 32 10 17 57 09 50 
12. 02 66 10 00 18 14 21 27 62 
18. —03 21 05 13  —05 44 33 09 38 
14. 17 29 04 43 33 13 31 16 55 
15. 102 ВО ОВ ао 1048) о в 
о 2 
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Variable MSI» CSU CSC CSR CSS CSI NSS NST h 


1 518 16 05 15  —15 10 18 42 32 
2, —20 21 16 п 55 21 08 10 48 
3. 02 49 —02 —05 08 07 30 37 48 
4. 05 —01 15 28 38 12 49 08 51 
5 * 30 00 47 17 18 31 23 15 54 
6. —17 16 18 21 10 15 43 12 86 
7. 48 08 и 10 —15 09 33 Il. 44 
8. 17 10 28 10 13 50 52 10 67 
9. —04 01 17 72 15 12 17 23 67 
10. 15 12 34 24 26 31 23  —05 43 
11. 08 25 36 36 32 —14 59 17 88 
12 05 49 23 30 12 16 14 18 48 
13. —07 14 22 16 11 37 21 29 38 
14. 13 22 24 50 20 21 35 05 58 
15. 20 33 22 16 25 10 15 66 75 


Note,—Decimal points omitted. 


The factor patterns in the rotated solutions of Table 3 appear to 
be similar. The 1961 data yield a pattern with 13 “misses,” loadings 
of tests over .30 where .00 loadings were hypothesized and ‘targeted. 
The clearest factor in this solution is CSC, with no misses, and the 
least clear is NSS with seven misses. The item-analyzed (1962) 
data yield a much clearer factor pattern, having only eight misses. 
The factors are very similar, but are clearer in the 1962 analysis in 
every case. The factor pattern obtained from the 1964 data is also 
similar. Since the data are based upon the administration of altered 
test forms to a different sample of examinees, it could not be ex- 
pected to be as similar to the first two patterns as they are to each 
other. For all eight factors, however, the pattern of loadings is very 
much like that of the 1961 and 1962 data. In most cases the factors 
are less clear, but nonetheless defined. It should be noted here that 
the unclear factor of the 1961 and 1962 analyses is the same as the 
unclear one in the 1964 analysis. A possible explanation is that 
tests for NSS may be too complex to measure one factor only. 

As a further evaluation of the tenability of the conclusion that 
each structure-of-intellect factor has a similar set of loadings in all 
three solutions, coefficients of congruence (Tucker, 1951) were com- 
puted between the factors. Table 4 presents the Tucker $ coeffi- 
cients between like-named factors. 

The coefficients of congruence between the same factors in the 
1961 and 1962 analyses are all very close to 1.0, a spurious result to 
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TABLE 4 
Congruence Coefficients ($) between Similar Factors in Three Analyses 


е Ыар 
Analyses MSI CSU CSC CSR OSS CSI NSS» NST 


1961 and 1962 94 .97 .99 .99 .96 .95 “99 +95 
1961 and 1964 -49 .94 .88 .89 .87 71 .88 .83 
1962 апа 1964 -49 .99 .83 .86 .82 .76 .84 .90 


be expected from reanalysis of the same data. Between the 1961 and 
1964 analysis (the erucial comparison) most factors appear highly 
congruent. The exception to this conclusion, factor MSI, is prob- 
ably due to the fact that this factor was determined as a singlet, 
and as such exhibited little stability between analyses. Compari- 
sons between the 1962 and 1964 analyses are similar to those be- 
tween 1961 and 1964. It is concluded that, except for the singlet 
factor, the intensive item analysis has resulted in a battery of tests 
capable „of reliably Teflecting the same dimensions of individual 
differences with a great saving of testing time. 


Summary 


A battery of 15 symbolic tests was item analyzed so that only 
those items correlating high with tests of the same factor and low 
with tests measuring other factors were retained. The item-analyzed 
tests were then readministered to a different sample. Comparison of 
the eight factor dimensions of original and analyzed data reveals 
that the shorter, item-analyzed test battery reliably reflects the 


same dimensions of individual differences in intellectual function- 
ing. 
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COMPUTATIONAL REMARKS ON A MEASURE 
FOR COMPARING FACTORS 


ROBERT G. RYDER 


Child Research Branch 
National Institute of Mental Health 
Bethesda, Maryland 


Pinnuau and Newhouse (1964) suggest comparing factors from 
different samples by computing factor scores based on each set of 
loadings, but using only one sample of test scores, and intercorre- 
lating these various factor scores. For example, a high correlation 
between scores for factor i from sample I and scores for factor j 
from sample II would indicate similarity between factors $ and j. 
The same set of tests must be used with each sample for this method 
to be applicable. 

Certain simplifications are available which allow one to compute 
correlations between factor scores without using the original sub- 
jects by tests score matrix. One needs only the intercorrelations 
among variables and the relevant factor loadings. If unrotated 
principal axes are used, computation becomes quite simple indeed 
and depends only on the factor loadings of the factors to be com- 
pared. In the latter case it becomes convenient to base interfactor 
correlations on the total sample, combining I and II. 


Terms and Assumptions 
The tests by factors unrotated factor matrix for sample I is Fr, 


and for sample II is Ри. Assuming factoring is complete, ЕТ = 
В; the matrix of interest correlations for sample І, and similarly 
FF’ = Ви. И communalities are used Ry and Ry have commu- 
nalities rather than unities as main diagonal elements. Where 


factors are principal axes, РР; = D; and РиРи = Dy where Dr 
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and Dy are diagonal matrices of sums of squared factor loadings. 
Only orthogonal rotation is considered, where the rotated factor 
matrix from sample I is Ку = F;Ay, A; being a square orthonormal, 
and similarly Ки = ЁпАп. 

The principal simplifying assumption is that means and standard 
deviations are the same for both sample I and II, and that no seri- 
ous distortion is made by considering both samples to be the same 
size. That is, we estimate the matrix of intertest correlations for the 
total combined sample to be (В; + Rm) /2. 


Orthogonally Rotated Factors 


A matrix of between-analysis interfactor correlations can be de- 
rived for each sample of subjects. We assume for present purposes 
that multivariate regression coefficients, by which test scores are 
multiplied to yield factor Scores, remain fixed from sample to sam- 
ple. (If new weights were derived by a separate multivariate re- 
gression procedure for each sample, the same two sets of interfactor 
correlations would result; but the set for sample I would now be the 
set for sample II, and vice versa). The between-analysis interfactor 
correlations based on sample I, say Вы, are 


Ru = (CR, )R (Ri Kj) 
= Ky Ry ‘Ки (1) 
and of course for sample IT, 
Re = КВК. (2) 
Unrotated Principal Axes 


Here things become much easier computationally, because of the 
simpler nature of factor scores (Ryder, 1965). The between-analysis 


interfactor correlations based on sample I are 
Ry = (Р, Т.) (ЕР, п) 
= (Dr^F/F;F, Rr )В ВР, пРп'Е, пр) 
= (DF ЕР (Р, прп) 
= FF, при ^ 


(8) 
and for sample II they are 


Еп = рур Fy (4) 


FT 
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The individual scalar correlation, say ry, between scores for 
factor $ from sample I and scores for factor j from sample II, based 
on the total combined sample of subjects, is 


чир >; їна/5 fie) + 25, fi?) > Tfl 
Lhe + Lhe 
25 fe She 

t t 
where fu and f,; are factor loadings of test t on, respectively, fac- 
tors $ and j. In other words ry involves little more than obtaining 
the sums of squares and sums of cross products of the relevant fac- 
tor loadings. 

Note that although complete factoring is assumed in deriving Вл 


and Ryn, the resulting coefficients do not depend on obtaining any 
factors except those to be compared. 


Equation 5 can be further simplified if У), fa? = У), f,,?. Then, 
»» umm 

Ti = V 2j MEM (6) 
which is the Tucker coefficient of congruence (Tucker, 1951). Un- 
der these special conditions, then the Tucker coefficient is the corre- 
lation between factor scores based on the total combined sample. 
Equation 6 raises the possibility that the Tucker coefficient may 
often be a rough indicator of the correlation between factor scores; 
but only for factors with at least roughly similar sums of squared 
loadings, and only when unrotated axes are used. 

Since computations with unrotated principal axes are simple and 
require little information, it might be pointed out that they can also 
be useful, even if one is primarily interested in rotation. For exam- 
ple, the number of unrotated axes that are stable from sample to 
sample might be used as a criterion for deciding how many axes to 
rotate. Also, where factors are principal axes, if A; and Ар are 
known, the correlations among unrotated factors can be used to 
simplify the computation of correlations among rotated factors, as 


Еы = Ar RA; (7) 


is > Feifei (5) 


and 
Ёш = АЕА (8) 
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THE USE OF FACTOR MODELS IN CURRICULUM 
EVALUATION: A MATHEMATICAL MODEL RELATING 
TWO FACTOR STRUCTURES 


PETER A. TAYLOR 
Rutgers—The State University 


Ам earlier paper presenting a model of curriculum evaluation, 
written with Maguire (1966), proposed the sectioning of the evalu- 
ation process into a number of steps. The steps involved the iden- 
tification of broad, societal-institutional objectives; the rephrasing 
of these latter into operational terms more suitable for measure- 
ment operations; the translation of operationally-stated objectives 
into teaching practices; and the student outcomes after being sub- 
jected to these practices. In the Taylor-Maguire model, it was 
stressed that different (though not necessarily disjoint) sets of in- 
dividuals were involved at each step, and in passing judgment upon 
the values involved in the transition from step to step. 

It is quite obvious to all evaluators that these different sets of 
individuals frequently view educational needs and goals from “bi- 
ased” vantage points. For example, while set theory may seem to 
the mathematician to be indispensable to the development of mod- 
ern algebra, to the “lay” parent it may seem a waste of their child’s 
time. Even experts may differ in what they consider to be important, 
or central, to a discipline. One approach to biology, for instance, 
may be ecological, another taxonomic. 

Without describing a multitude of procedures in detail here, we 
shall simply assert that it is possible to obtain “maps” of the per- 
ceptions and preferences of individuals as to what they consider to 
be important in the development of a curriculum. Any standard 
text on psychological scaling will outline many appropriate experi- 
mental methods. Of particular interest in the present instance, how- 
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ever, are those techniques, such as pair comparisons followed by 
multidimensional scaling and factoring, which yield an n-space. 
Such a space will have as many (n) dimensions as there are ways 
in which an individual (or group of individuals) views a curricu- 
lum, Thus, a chemist may look at a chemistry curriculum in terms 
of its factual content, its methodological content, its contribution 
to an understanding of applied chemistry, and so on. A housewife 
may look at the same curriculum, and view it in terms of its factual 
content, the contribution the content makes to solving everyday 
problems, the moral aspects of the use of knowledge of atomic 
structure, and so forth. Some of these perceptual dimensions are the 
same, others (clearly) are different. 

The methods of factor analysis may be used to identify and to 
help label these dimensions on which persons view a curriculum. To 
the educational evaluator, it is important not only to identify these 
dimensions per se, but to identify congruences and discrepancies in 
viewpoint so that, having brought the differences into the open, they 
may be-diseussed and—hopefully—some agreement reached as to 
the relative merits of the different stands. 

The remainder of this paper is devoted to a theoretical exposition 
of a method for matching the factor structures of the viewpoints of 
two sets of individuals. We shall assume that we have available 
both factorial structures as the result of carrying out some experi- 
mental procedure and the concomitant factoring, 


Historical Overview 


The investigation of a mathematical relationship between two 
factorial structures belongs to a branch of canonical analysis. The 
applicability of canonical analysis to the field of educational re- 
search was first proposed by Hotelling in his The Most Predictable 
Criterion (1935). Since then, others have supported the applicabil- 
ity of this approach; e.g., Thomson (1947), Burt (1948), Bartlett 
(1948), Horst (1961). 

In 1948, M. S. Bartlett proposed a concept of internal and ex- 
ternal factor analysis. To take his example, suppose that a set of 
mental tests and a set of physical measurements are made on the 
same group of persons. The internal analysis of the mental test 
scores will yield the ordinary factor analysis, and also the internal 
analysis of the physical measurements will yield a factor analysis 
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of physical structure. The mutual relations, if any, between the 
two sets of measures would be examined by means of external fac- 
tor analysis. Bartlett’s method of external factor analysis is a tech- 
nique for matching the underlying factors for two sets of variates, 
in this instance, the mental test scores and the physical measures. 
The factor matrices for both sets of variates are simultaneously re- 
lated by orthogonal transformations until a factor from one set is 
maximally correlated with a factor from the other set, thus identi- 
fying the first factor pair. These factors are held fixed while a sec- 
ond pair is identified, and so on. 

As for the stability of factors over different batteries of tests, 
Ledyard R Tucker developed an Inter-battery method of factor 
analysis (1958). Given two test batteries, not composed of parallel 
tests but postulated to depend on the same common factors, factors 
that are common to the two batteries are determined from the 
correlations of the tests in one battery with tests in the other bat- 
tery. Gibson (1960a,b, 1961) further expanded Tucker’s method. 

Both Bartlett and Tucker’s approaches are related to thé present 
problem, as far as mathematical technique is concerned. But they 
are different from the present problem in that the former's model 
treats the case where the same persons take different tests, while 
the latter’s model treats the case in which different subjects respond 
to the same tests. This type of problem has been treated in the con- 
text of factorial invariance [see, for example, Tucker (1951), Ley- 
den (1953), Barlow and Burt (1954), Zachert and Friedman 
(1953), Wrigley and Neuhause (1955), Meredith (1964a,b) and 
Pinneau and Newhouse (1964) ]. 


Present Problem Defined 


The present problem is the determination of axes so as to take ac- 
count of different sets of “tests” (in an evaluation context, these we 
shall refer to as “objectives”) over different groups of subjects, si- 
multaneously. 

Suppose, for example, that we are developing a curriculum, and 
the textbook writers (hereafter identified by a W), who will them- 
selves have a certain viewpoint, are consulting a group of experts 
(E’s) who will also have some view of the content. Suppose further, 
we have somehow asked both of these groups to declare their “valu- 
ation" of the curriculum objectives, and that we have done so in а 
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fashion that has enabled us to obtain a factorial representation of 
these values. Finally, suppose that My common factors are taken 
from writer-rating scores on n objectives, and My common factors 
are taken from expert-rating scores for the same m objectives. 
(These objectives may, or may not, be identical for the two groups. 
Technical language may differ; some objectives may have no mean- 
ing for one group, and so on, but the import should be the same for 
both.) The My, Mz are not necessarily equal. Now, instead of tak- 
ing the sum of squares of the first factor loadings over n objectives 
to be a maximum for each group of data as the principal axis 
method does, one may fix the first reference axis so that the first- 
factor loadings of n objectives for the W data are “as similar as 
possible" to the first-factor loadings for the E data. (The question 
of a criterion for similarity will be returned to later.) The second 
reference axis, also, could be chosen so that the second-factor load- 
ings of the n objectives for the W data are again as similar as pos- 
sible to the second-factor loadings of the E data, with the added 
condition that the second-factor loadings are orthogonal to the first. 
The same process is followed until the pairs of factor loadings are 
no longer similar to each other. The resulting system of matched 
factors obtained in this fashion we shall refer to as “congruent” 
factors. Thus, the first factor derived from the E scores is maxi- 
mally congruent with the first factor obtained from the W scores. 
The second factor derived from the E scores is the next most 
congruent with the second factor from the W scores, and so on. 
Following Tucker’s (1951) precedent, the degree of congruence 
will be defined as the sum of the cross-products of the matched- 
factor loadings where the loadings have been normalized over the 
п objectives and the sum of the squared coefficients is unity (see 
Harman, 1960, pp. 255). The number so obtained is referred to as 
the “coefficient of congruence.” It is at once obvious that the co- 
efficient of congruence so defined differs from a correlation coefficient 
since the latter involves the sum of the cross-products of the devia- 
tion from the means with unit over-all variance, while the coefficient 
of congruence refers to the sum of cross-products of the unit varia- 
bles. Like the correlation coefficient, however, the coefficient of 
congruence lies between —1 and +1 (inclusive). An obtained value 
of +1 would mean that the two sets of matched factor loadings (rep- 
resenting the two viewpoints) were identical; a value of —1 would 
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mean that identity existed, but in the “opposite direction”; and a 
zero value would mean that there was no relationship between the 
loadings. 

The establishment оѓ 'а minimum coefficient of congruence, values 
below which would result in the statement that “the factors 
do not match,” is pretty much a subjective matter (but see Tucker, 
1951, p. 18; Harman, 1960, p. 259). Suppose that the data lead to 
p sets of pairs of congruent factors, where p < (M;)*, k = E, W, 
and (M;)* is the smaller of Mz, My. In the case where writers and 
experts are being compared, the (My — р) unpaired factors would 
be interpreted as being the factors involved in the writers’ ratings 
only and non-congruent with experts’ ratings. A similar interpreta- 
tion would be given the (Mg — p) unpaired factors. 

In summary, then, the problem posed requires the identification 
of congruent factors (if any) involved in the rating of a given set 
of objectives by each of two groups, in this case, curriculum writers 
and experts. (Teachers, students and the lay public could ре equally 
well substituted for one or both of these groups.) 


Mathematical Presentation 


Suppose data are available from writers’ and experts’ ratings of 
curriculum objectives. Let 


211 212 eT ZNw 21182128 get, “ins 
221 w222 w nm Фу» 22152228 Wee 22мв 
Zw=|- and 2==| · 
211 уузу ttt Ew tt Ут Эла в ``" Pig ``" PING 
ni wena ғ 5% BaN v. 21522 в м anna 
(1) 


where 


j =1,2 --- n, n = number of objectives being rated 
tw = ly, 2w, e М», Nw = number of writers ranking the 
objectives 
ig = 15,25, +++ Ng, Мк = numberof experts ranking the objectives 
(Zi) vy has а (j, ty)th element which is a standard rating of the 
3% objective by the i writer. 
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fashion that has enabled us to obtain a factorial representation of 
these values. Finally, suppose that Mw common factors are taken 
from writer-rating scores on n objectives, and My common factors 
are taken from expert-rating scores for the same m objectives. 
(These objectives may, or may not, be identical for the two groups. 
Technical language may differ; some objectives may have no mean- 
ing for one group, and so on, but the import should be the same for 
both.) The My, My are not necessarily equal. Now, instead of tak- 
ing the sum of squares of the first factor loadings over n objectives 
to be a maximum for each group of data as the principal axis 
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factor loadings of n objectives for the W data are “as similar as 
possible" to the first-factor loadings for the E data. (The question 
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ings of the n objectives for the W data are again as similar as pos- 
sible to the second-factor loadings of the E data, with the added 
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no longer similar to each other. The resulting system of matched 
factors obtained in this fashion we shall refer to as “congruent” 
factors. Thus, the first factor derived from the E scores is maxi- 
mally congruent with the first factor obtained from the W scores. 
The second factor derived from the E scores is the next most 
congruent with the second factor from the W scores, and so on. 
Following Tucker’s (1951) precedent, the degree of congruence 
will be defined as the sum of the cross-products of the matched- 
factor loadings where the loadings have been normalized over the 
n objectives and the sum of the squared coefficients is unity (see 
Harman, 1960, pp. 255). The number so obtained is referred to as 
the “coefficient of congruence.” It is at once obvious that the co- 
efficient of congruence so defined differs from a correlation coefficient 
since the latter involves the sum of the cross-products of the devia- 
tion from the means with unit over-all variance, while the coefficient 
of congruence refers to the sum of cross-products of the unit varia- 
bles. Like the correlation coefficient, however, the coefficient of 
congruence lies between —1 and +1 (inclusive). An obtained value 
of +1 would mean that the two sets of matched factor loadings (rep- 
resenting the two viewpoints) were identical; a value of —1 would 
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mean that identity existed, but in the “opposite direction"; and a 
zero value would mean that there was no relationship between the 
loadings. 

The establishment of'a minimum coefficient of congruence, values 
below which would result in the statement that “the factors 
do not match," is pretty much a subjective matter (but see Tucker, 
1951, p. 18; Harman, 1960, p. 259). Suppose that the data lead to 
p sets of pairs of congruent factors, where p € (М»)*, k = E, W, 
and (M;)* is the smaller of Mg, My. In the case where writers and 
experts are being compared, the (Му — р) unpaired factors would 
be interpreted as being the factors involved in the writers' ratings 
only and non-congruent with experts' ratings. A similar interpreta- 
tion would be given the (Mg — р) unpaired factors. 

In summary, then, the problem posed requires the identification 
of congruent factors (if any) involved in the rating of a given set 
of objectives by each of two groups, in this case, curriculum writers 
and experts. (Teachers, students and the lay publie could n equally 
well substituted for one or both of these groups.) 


Mathematical Presentation 


Suppose data are available from writers’ and experts’ ratings of 
curriculum objectives. Let 


211 wiw tet Zw NOS ‚ө Sina 
221 у222 w Hike ©з 22152228 oi Zane 
Zw=|- and Zr=| - 
2j1weiaw °° * Zim °° "Ми ГЕР ЫШТЫШ Л 
Ёл ула т АЙ, nN w. Ёп s n2 в xen 2һув 
a) 
where 


j -1,2-.. n, n = number of objectives being rated 
iw = ly, 2w, «++ Nw, Nw = number of writers ranking the 
objectives 
tg = 1g,25,:-- Ng, Ng = numberof experts ranking the objectives 
„(2т)к» has a (j, iy)th element which is a standard rating of the 
3% objective by the ту writer. 
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»(Zz)y, has a (j, ғ) element which is a standard rating of the 

j** objective by the ig expert. 

Clearly, for matrices Zy and Zg, the entries bearing the same 
row subscript, j, represent equivalent objectives, the difference 
(represented by the 7, subseript) being that in one case writers have 
done the ranking, the other, experts. 

For convenience, the scores in each row of both matrices are 
assumed to be standardized, i.e., 


1 Nw 
о 5 


1 Ув 
ие о @ 
1 Nw 


Ny deter 7 e 


and 


1 Мк К 
М, p» 2j, =1 (5) 


Hence, the product of Zw and its transpose, 2’», divided by Nw 
yields a matrix of correlation coefficients, Ry, between objeetives 
judged by writers, i.e., 


un 


Ry = Ny ZwZy! (6) 


similarly, 


1 
Rz = N; КУЛА (7) 


where E; is a matrix of correlation coefficients between objectives 
judged by experts. 

According to the classical factor-analytic model, it is assumed 
that each of the matrices Z. w, Zz may be factorized as: 


Zw = AwXw + Uy (8) 
Zz = А.Х» + Uz (9) 
or Zw, Z are respectively approximated by 
2» = AyXy (10) 
2. = АХ (11) 
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Here, Ау is а „Аму matrix, My < n, which may be called a 
factor coefficient matrix for writer ratings of objectives; Xw is a 
as Хуу matrix, My < Nw, which is called a factor score matrix 
for writer ratings of objectives. А» and X; are „Ам, (Mz < m) 
and м„Хук (Mg < Nz) matrices similarly defined for experts. 
Restrictions are placed on Mz, My, and it should be recollected 
that Mz, My are not necessarily equal. Since the analysis is done 
in parallel for any Z+ it will now be convenient to discard subscripts 
on each of Z, A, X and U in the meantime. 

According to Eckart and Young (1936), a matrix Z is constructed 
to the desired degree of approximation in a form 


2 = FLX (12) 
where F is а „X, orthonormal matrix such that 
F'F = I, (13) 


I, being an identity matrix of order k, and L being a kXk diagonal 
matrix, the diagonal elements of which are positive and arranged 
in decending order of magnitude. X is then a kXN orthogonal 
matrix such that 

1 


м XxX’ = 1, (14) 


If we put 
А = FL (15) 
then equation (12) becomes 
"2 = AX (16) 


In the present model it is convenient to start from the inter- 
correlation matrix 


2lggz 
R= у 22 (17) 


The components F, L and X in equation (12) could be determined 
from the latent roots and vectors of В. Since R is a square, sym- 
metric matrix with unities in the principal diagonal, it may be 
analyzed directly by a principal component analysis (Hotelling, 1933; 
Harman, 1960). 

Suppose that R is approximated by 
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1 (Д 
=F 22 (18) 
then, substituting from equation (12) 
R-EFLXXUF (19) 
which, since (1/N)XX’ = I and L = I’, reduces to 
В = РЕ! (20) 


It is of note that Z? is identical to the latent roots of R, and Р is 
a set of latent vectors associated with L^. This follows from (20) 
since 

RF = FIF'F = РІ? (21) 


The product of Ё and of each of the column vectors of F is a 
vector which is proportional to that chosen column vector of F, 
the constant of proportionality being the corresponding element of 
the diagonal of L°. This result means that F is а set of latent vectors 
of В associated with k latent roots. Further, since k latent roots 
of Ê are the first k largest roots of R, the ratio of the sum of the 
diagonal elements of L^ to the sum of the diagonal elements of R 
provides an index of how closely R is approximated by Ё. 

Since A = FL by equation (15), substitution in equation (20) yields 

R= AA’ (22) 


and using equation (13) 
A'A = L'F'FL = I? Q3) 
Both (22) and (23) are thus illustrative of defining properties of A. 


Returning to the present problem. Involved herein are sets of 
data expressed in the form: 


Zy = А-Х» = FyLyX, (24) 
and 
2. = А.Х, = Р.Х (25) 


where 
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[quete я Чум» 


0215022 w P EE азму 
Ay = |: (26) 
атубу "t Ow c iw 
Lani wana w kii nM w. 
= [Mrtav *°* Чу cit Auw] Q7) 
and 
[2115022 m Gime 
02150225 к азмв 
Ав = |: (28) 
ај в0;ав *** @ ца ``" Ма 
| 051 snaz ‚Аа, ам. 
= [41,02» ^t бв *** ам ғ] (29) 


for j = 1,2 +++ n; pw = 15,2 ++: Mwids = 1s, 2z ++: Мв. 

The elements of the matrices represented in equations (27) and 
(29) are column vectors resulting from the partitioning of the ma- 
trices (20) and (28) respectively, for each column. 

The question to be answered is: how similar are Aw, Az? Or, 
how nearly equal are а, and аав? 

As a matter of fact, the coordinates in the factor space are chosen 
somewhat arbitrarily, so that a transformed matrix, say 


Bs = ATs (80) 
will also produce a factor matrix, but with different coordinates. 
Т, is a transformation matrix for E data; similarly, there is а trans- 
formation matrix Ту for W data so that 

By = А»Ту (81) 

A problem is to choose factor matrices Bz, By so that similarity 

between two factor matrices is maximized by setting appropriate 
coordinates. 

First, an index to express this degree of similarity between the 
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two factor matrices has to be determined. When the factor matrices 
are given by (26) and (28) the degree of similarity of ру and gz 
respectively could be defined by 


а 
РУ ү 
1-1 


hut re EE (32) 
2 tiw > jen” 


or, using the vector notation of (29) and (80) 


Opp doy 
= 33, 
hores 7 Ута, авы) (89) 
If we express all the h,,,.,’s in matrix form, we have 
Нув = Dy Aw AD” (84) 


where the elements of Hwg are h,,,,, Dw and Dp are diagonal 
matrices whose diagonal elements are equal to the diagonal elements 
of А’»А и and A'zA к respectively. 

Clearly, if the column py of Aw and the column Qz of Ав are 
identical then 


hou = 1 (35) 
and if the two columns are orthogonal, 
Twas =0 (36) 


In fact, —1 € pran < +1. Such a measure of the degree of similarity 
is equivalent to Burt’s (1948) “unadjusted correlation” ; Tucker's 
(1951) “coefficient of congruence” ;and Wrigley and Neuhause's (1955) 
"degree of similarity." We shall adopt ker's terminology. 

Now, the coefficient of congruence bétween the matrices By and 
B, in equations (30) and (31) can be maximized by the appropriate 
choice of factor matrices, as follows. In defining a congruent space 
between By and В», the first factor of By is matched with the 
first factor of By so that the greatest coefficient of congruence is 
obtained. This process is reiterated for subsequent factors until a 
significant degree of congruence can no longer be obtained. Generally, 
the number of congruent factors is less than the number of congruent 
factors of either А» or Ag. Non-congruent factors are considered 
as factors accounting for only one of the factor matrices Ay or Ах. 

Suppose we have two orthonormal matrices Fy and Ев such 
that Pw F'y = Ig and FF’, = In, (see equation (13)). As Fy, Fz 
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are considered to be normalized factor matrices of Aw, Ав respec- 
tively, over each column, the coefficient of congruence as defined 
by equation (34) is given simply by 
Hyg = Fy'Fs (88) 
If we transform Fz by post-multiplying by a unit M,-vector V;,, 
where V’,,Vi, = 1, and Fy by post-multiplying by a unit M y-vector 
Viw, where У’„У» = 1, the coefficient of congruence between the 
transformed vectors FzV,, and РТ.» becomes 
hiis = (FwVis) (Ев Тан) (39) 
since 
(Р» Vis)'(FwViy) = 1 = (РУ) (ЕТ) (40) 
The maximization of À;,,, by choice of appropriate weighting 
vectors Vi, and V,, under the conditions of equation (40) is the 
well-known problem of canonical analysis. 
We may now define 
J = Һи» Е» У) (ЕУ: v) "f 1} 
— ġuf (F eVi) (Ёк) — 1} 
= Vie! (ЕР), — IN Viw’ Vin — 1) — ЗУ, —1) (41) 
where А and д are Lagrange multipliers. Partially differentiating 
(41) with respect to У» and V;, in turn, and setting the partial 
derivatives equal to zero, we have 


т, = (РАУ, — Mi. = 0 (42) 
J = (FyFz)Viy — Ы = 0 (43) 
Premultiplying (42) by V’ı and noticing that V';, Уз» = 1, we have 
(Fg Vis) (FV) – \ = 0 (44) 

Similarly, premultiplying (43) by Vis and using V^, Vi, = 1, 
(Vi EV) — n = 0 (45) 

Hence, from equation (40) since 

(FwViw)'(F2Viz) = (FV12)'(FwVi5) (46) 


Syp (47) 
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Multiplying equation (42) by 


(Fw'Fe)Viz — У, = 0 (48) 
and by using equation (47), equation (43) becomes 
(РР) Viy — Viz = 0 (49) 
Substituting from (49) into (48) we get 
(РР) (ЕЕ) Vir —MV,, = 0 (50) 
and by putting 
Ст» = (Fw'Fs)(Fy'F x)! (51) 
equation (50) becomes 
(Оуу — №ПУ,, = 0 (52) 
which is a familiar form. 
Multiplying equation (43) by и, and substituting А for р by (47) 
(РР) Vig — №Ү,, = 0 (53) 
Substituting for ХУ, from (42) into (53) we get 
(ЕРЕ) (Fy'Fz)Vi, — YT, —0 (54) 
and by putting 
Grs = (ЕЕ. x) (F w'F, Р) (55) 


equation (54) becomes 
Gaz — ХПУ,, = 0 (56) 
Solutions for (52) and (56) now become ordinary latent root and 
vector problems, with the conditions that non-trivial solutions exist 
being 
буу — XI| = 0 (57) 
and 


[в — XI| =0 (58) 

If the rank of бу is r, во is the rank of Gzx and both equations 
(57) and (58) will generally have the same r distinct positive latent 
roots. Suppose these roots are },°, №, «++ А2, where M? > M? > 
++) > Ne. Det Pini Van ро and Viz, Va, --- V,, be the 


appropriate associated non-zero latent vectors, of буу, Оев Te- 
spectively. 
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Suppose we take the first (largest) root M," and the associated 
vectors Vi, from Gww and V,, from Gs. Then the coefficient of 
congruence, as defined by equation (38), between F Vi, and Fu Vi» 
is maximized. 


ив = Vi wv FwF Vis (59) 
= № (60) 


since F’yFsViz = МУ, by (42) and V',,V;,, = 1. By a similar 
procedure we would obtain for the second largest root 


уз —M20 (61) 
and it is maximum in the residual space аз defined by 
(ЕЕ) — (FwVig)FsViz) 
Furthermore, it is known that 


Viw Vas = 0 (02) 
Vas Vi, = 0 a (68) 

and 
мВ ни = 0. (64) 


(see Harman, 1960; Horst, 1961). Thus, the procedures of equation 
(60) can be continued until the vectors У» and ТУ,» associated 
with А, are extracted. 

For simplicity, if we define an r X r diagonal matrix 


№ 
д=| №. (65) 
Ar. 
and an My X r matrix made of column vectors Viw, Vow, Ves; 
У» = [Viws Vir: Vivl? (66) 
equation (52) is expressed in a form 
буу? = Vrt (67) 
where 
У» Ут = І, r (68) 


Similarly, we can define an Mz X r matrix of column vectors 
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Ув = [Vis Van У] (69) 
whence equation (56) is expressible in the form 
Gss? = У, А? (70) 
where 
Ув» = I, (71) 


These A", У», and Vy can be obtained by ordinary principal com- 
ponent method, for Суу and (кх. 
Incidentally, as seen in equations (42), (43) and (47) 


(Fw'Fs)Vz = ҮА (72) 
and 
Gy Fo) Vy = Ved (73) 


These equations could be used to solve the one set of vectors 

when the other set of vectors is given. For instance, 
я Vw = (ЕЕ) УА" (74) 
when Vz is given, and 
Ув = (РР) УА" (75) 
when Vy is given. Since the У» can be obtained by solving the 
latent vectors of Су», and Vs can be independently obtained by 
solving the latent vectors of бъ». Equations (74) and (75) can be 
used for checking computation. 

Thus, by the transformations of Fy by У», and Е» by Vz, we 
can define new factor matrices By and В» in which the first factor 
of By is matched maximally with the first factor of Bz, the second 
factors of each are next maximally matched, and so on, retaining 
orthogonality of the kt factor to the (k + 1)%. The new matrices 
By and B, are given by 

By = FyVy (76) 
and 
By = Е, УЕ (77) 
The coefficients of congruence between factors in different sets are 
given by 
Нук = By'Bg = А (78) 


*In order to satisfy equations (42). and (43) it may be necessary to adjust 
the directions of the column vectors of Vy and Vz. hs 
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whose diagonal elements indicate the coefficients of congruence 
between matched factors, and whose off-diagonal elements indicate 
the coefficients of congruence between unmatched factors, sup- 
posedly zero within a rounding error. As the diagonal elements 
of By and B, are arranged in descending order of magnitude, the 
last few diagonal entries might be small and negligible under some 
criterion. Thus, one might accept only those coefficients of congruence 
over 0.9, and regard the rest as belonging to non-congruent factors. 
We note that 


By'By = I, r (79) 
and 
B;'By = I, r (80) 


so that the Bw, Bz may be regarded as normalized factor coefficient 
matrices. These latter are convenient for the comparison of two 
factorial structures with different units of measurement. 

If Aw and Ах are already available, as in equations (30) and (31), 
By and Bg can be obtained directly by using the transformation 
matrices Ту and Те. Since 


Ay = FyLy (81) 
and 
Ав = F;Lz (82) 
by definition (see equation (15)), then 
Fy = Аз * (83) 
and 
Ев = Ак * (84) 
substituting (83) in (76), (84) in (77) we get 
By = AwLw Vy (85) 
Bg = АвГ Ёк (86) 
Hence, 
Ty = Ly Vw (87) 


Те = Ls Vs (88) 
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Summary of Computational Procedures 


1. Obtain the intercorrelation matrices Ry and Rs 

2. By the principal component method, find a diagonal matrix Ly’ 
consisting of the first My significant latent roots of Ру and their 
associated vectors Fy, where F'wFw = Ги». Similarly, find the 
first М» significant latent roots matrix L,” and the associated unit 
vectors Ёк of Ry, Р’вРв = Iug. 

3. If desired, compute Ау = FyLw, Ав = Felz, which yield 
principal components of Ry, Rs, respectively, in Hotelling’s sense. 

4. Determine the positive latent roots matrix A" and the unit 
vectors Vw of Gww = (Е' Р) (Е' Гк)’, by the principal component 
method. 

5. Obtain Vg = (F’wFs)/VwA™. 

6. Similarly, obtain the latent roots, A", and vectors, Vz, of Grz, 
reflecting the column vectors if necessary. Check that the A", Vz 
are equal to those obtained in steps 4 and 5. 

7. Obtain Vy = (Ек) ИА, and check that it is the same 
as that found in step 4. 

8. Compute В, = РҮ», By = FwVy. Both By and By give 
the most nearly congruent factor matrices for corresponding columns. 

9. Compute Ньу = B'yB, whose elements are the coefficients 
of congruence between two sets of factors. Check if (В’„В») is 
equal to the latent roots matrix A* of step 4 (or 6). 


Numerical Example 


The author will be pleased to furnish, free of charge, a numerical 
example illustrating the above procedures. Requests should be sent 
care of the Department of Educational Psychology, Rutgers—The 
State University, New Brunswick, New Jersey 08901. 
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INTERRELATIONSHIPS AMONG PERSONALITY SCALE 
PARAMETERS: ITEM RESPONSE STABILITY AND 
SCALE RELIABILITY! 


RICHARD R. JONES 
Oregon Research Institute 
AND 
LEWIS R. GOLDBERG 
University of Oregon and Oregon Research Institute 


Fon the assessment of individual differences in hypothetically 
stable personality traits, it is desirable to develop scales which will 
maximize both scale score variance and score retest stability. Since 
scales are composed of individual items, it is reasonable to expect 
that characteristics of scales (e.g., score variance and score sta- 
bility) should be related to characteristics of their items. The 
cogency of this statement for scale construction is seen in the stand- 
ard practice of eliminating nonvalid items from scales to increase 
scale validity. In a similar manner, score retest stability should be 
related to the stability of responses to the scale’s items. 

The objective of developing scales to maximize both score vari- 
ance and score stability suggests two criteria for the selection of 
items: (a) good items should elicit responses which maximize indi- 
vidual differences, and (b) good items should elicit responses which 
are stable over time, In general, items with large variances and high 
retest stability should tend to maximize score variance and score 
stability respectively. The simultaneous satisfaction of these two 


1 This report is adapted from a doctoral dissertation carried out by the first 
author under the direction of the second author. The study was supported by 
Grant #G-25123 (GS-429) from the National Science Foundation to Lewis R. 
Goldberg at Oregon Research Institute. Data analyses were carried out at the 
Western Data Processing Center and the Health Science Computing Facility 
at the University of California at Los Angeles. 
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item selection criteria, however, highlights a psychometric paradox 
(e.g., Goldberg, 1963а), namely that responses to high variance 
items (1.е., moderate endorsement items) are typically unstable, 
while responses to low variance items (i.e., items with extreme en- 
dorsement percentages) tend to be stable from occasion to occa- 
sion. This paradoxical finding from itemmetric research seems to 
imply that between-individual score variance may have to be sac- 
rificed to obtain within-individual score stability, or vice-versa— 
unless it can be shown that the parallel between item variance and 
item stability on the one hand, and score variance and score stabil- 
ity on the other, is not as straightforward as common sense would 
suggest. The present study was designed to investigate empirically 
the relationships between scale parameters (such as scale stability, 
homogeneity, and variance) and various measures of average item 
stability. 


Procedure 


Ninety-five male and 108 female undergraduates in a General 
Psychology course at the University of Oregon were administered 
the California Psychological Inventory (CPI) in class, followed 
two weeks later by the Minnesota Multiphasic Personality Inven- 
tory (MMPI) ; four weeks after the first administration, each in- 
ventory was again administered in class to the same students 
(Goldberg, 1963b; Goldberg and Rorer, 1963; Goldberg and Rorer, 
1964). Scale parameters? for each of 199 MMPI and CPI scales? 
were computed separately for the male and for the female samples. 
The scale parameters included: (a) test-retest score stability (the 
product-moment correlation between scores across the two admin- 
istrations), (b) scale homogeneity (Kuder-Richardson Formula 20) 
for the first administration, (c) score variance for the first admin- 


?For purposes of this study, a scale parameter is broadly defined as any 
measure based on subjects’ responses to a set of items comprising a scale. For 
a discussion of the item properties from which some of the scale parameters 


du Ni dere see Goldberg and Rorer (1963) and Wiggins and Goldberg 


for each set of nonoverlapping scales—with no change in the results. 
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istration, (d) the number of items in the scale, and (ei) the mean 
value of five item properties over the set of items in the scale. The 
five mean item properties were: (e) mean item variance, (f) mean 
Stable (the percentage of subjects responding consistently to the 
item on retest), (g) mean Phi, (h) mean Phi/Phimax, and (i) mean 
Ambdex (Goldberg's [19639] index of item ambiguity, which es- 
sentially corrects Stable for item endorsement extremeness). 

The intercorrelations among these scale parameters, over the 199 
scales, were computed separately for the sample of 95 males and 
for the sample of 108 females. For ease of presentation, the signs of 
the correlations between mean Ambdex and the other scale param- 
eters were reversed, so that high mean Ambdex will imply high item 
response stability. Thus, mean Stable, mean Phi, mean Phi/Phimax 
and mean Ambdex, as measures of item response stability should 
intercorrelate positively. 

For all pairs of variables, four indices of association were com- 
puted: (a) Pearson product-moment coefficients for continuous 
data; (b) Pearson product-moment coefficients for grouped data 
(the values.of each variable having been grouped into stens); (е) 
and (d) two correlation ratios (etas) for rows and columns, respec- 
tively. Computation of the latter three statistics permitted two F- 
tests for curvilinearity (row-eta vs. grouped r, and column-eta vs. 
grouped r), each of which was tested for significance. Etas pre- 
sented in this paper are always significantly greater (p < .01) than 
their corresponding grouped r. Correlations (r) listed are always 
based on ungrouped data. 


Results and Discussion 


The results of this study will be presented in two sections. First, 
the relationships among the scale parameters will be presented, and 
second, the implications of the psychometric paradox for score re- 
liability will be discussed. 


Intercorrelations among Scale Parameters 

Table 1 presents the means, standard deviations, and intercorre- 
lations among the nine scale parameters for the two samples. While 
the signs of many of these correlations are predictable on the basis 
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of algebraic interdependence between variables, test theoretic pre- 
dictions, or simple logie, an empirical estimation of the magnitude 
of these relationships is of considerable interest. 

The correlations between the two types of scale reliability esti- 
mates, scale stability and scale homogeneity are positive (r = .74 
and .79) as would be expected, although these coefficients are not 
sufficiently large to permit the conclusion that these two reliability 
estimates are interchangeable measures of scale reliability. As Cat- 
tell (1964) among others has pointed out, a scale’s internal con- 
sistency may differ considerably from its temporal stability, either 
due to design in test construction, or due to the character of the trait 
which the scale measures. 

A second set of expected relationships are those between the num- 
ber of items in the scale and each of the two reliability estimates. 
Perhaps the best known conclusion from classical test theory (e.g., 
Gulliksen, 1950) is that scale reliability may be increased by 
lengthening the scale, and the Spearman-Brown formula provides 
a means of estimating the resultant reliability for increases in scale 
length. However, it does not follow from this test theoretic formula- 
tion that, for a set of different scales, long scales necessarily will be 
more reliable than short scales. In the case of a single scale, the 
underlying rationale for expecting increased reliability by length- 
ening the scale is that the augmented scale measures a greater pro- 
portion of true score variance than the shorter form of the same 
scale. For the case of many different scales, as in the present study, 
longer scales will show higher reliability than shorter scales only 
insofar as the longer scales measure a greater proportion of true 
score variance than the shorter scales. Since a relatively short scale 
may measure a greater proportion of true variance than a longer 
scale, an empirical measure of the relationship between scale length 
and scale reliability may be less than perfect. In fact, the present 
data show that these correlations are considerably less than unity; 
the correlations between number of items and scale stability were 
41 and .50, while the correlations between number of items and 
scale homogeneity were .49 and .50. 

A third set of expected relationships are those between the two 
types of scale reliability on the one hand and score variance and 
mean item variance on the other. Guilford (1954) concluded that 
for scales with less than perfect intercorrelations among items, а 
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concentration of items with high variances will tend to enhance 
scale reliability. For the scales in the present study, the correlations 


between mean item variance and scale stability were .46 and .24, . 


while the corresponding correlations with scale homogeneity were 
.23 and .11. Although these coefficients were in the expected direc- 
tion, their magnitude suggests that mean item variance is only mod- 
erately predictive of either type of scale reliability. Score variance, 
on the other hand, was more strongly associated with scale reliabil- 
ity, with correlations of .57 and .58 between score variance and scale 
stability, and .67 and .66 between score variance and scale homo- 
geneity. These coefficients, although artifactually inflated due to 
algebraic interdependence between the variables, illustrate the well- 
known conclusion that large score variance is, in general, a neces- 
sary but not sufficient condition for high score reliability. 

A fourth set of relationships involves those among the number of 
items in the scale, score variance, and mean item variance. Number 
of items and score variance were positively related (r — .72 and 
75), illustrating the constraint placed on score dispersions by the 
number of items on which the scores are based. The correlations be- 
tween score variance and mean item variance should be positive, 
but low, due to the greater dependence of score variance on a 
scale's inter-item covariances than item variances, Score variance 
is simply the sum of the elements in a scale’s inter-item variance- 
covariance matrix which contains m (number of items) variance 
terms and m (m — 1) covariance terms. The relatively small contri- 
bution to score variance from item variances compared to covari- 
ances is empirically demonstrated by the low correlation (r — .34 
and .31) between score variance and mean item variance. Finally, 
as might be expected, the number of items in a scale and its mean 
item variance were unrelated (r = .08 in both samples). 

Turning now to the relationships among the four mean item sta- 
bility measures; some unexpected findings emerge. If all of these 
variables are similar measures of response stability, their intercor- 
relations should be positive. However, the correlations between 


function of the fact that Phi is attenuated for extreme endorsement 


taneously being increased by increases in response stability, the 
negative correlations between mean Stable and mean Phi suggest 
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that the former constraint is more powerful than the latter. When 
mean item variance is partialled out, the partial correlations be- 
tween mean Stable and mean Phi rise to .75 and .50. 
"The linear correlations between mean Stable and mean Ambdex 
" (reflected) are negligible in this study, though non-linearity (eta = 
87 and .42) is evident in these relationships. These relationships are 
similar to those obtained in itemmetric studies (e.g, Goldberg, 
1965; Wiggins and Goldberg, 1965) when many extreme items are 
ineluded in the item pool. When extreme items are progressively 
eliminated, however, the relationship between Stable and Ambdex 
increases dramatically. In the present study, when mean item vari- 
ance is partialled out, the partial correlations between mean Stable 
and mean Ambdex (reflected) were .85 and .48. 

Perhaps the most significant set of relationships illustrated in 
Table 1 are those between the four mean item stability measures 
and the five other scale parameters. Here the focus of interest is on 
the extent to which scale reliability and score variance can be pre- 
dicted from the properties of the average item in the scale.*As Table 
1 indicates, scale retest stability was positively associated with 
mean Phi (r = .45 and .31), mean Phi/Phimax (r = .23 and .19), 
and mean Ambdex (reflected) (r = .32 and .31). These correlations 
are in the expected direction, indicating that the stability of the 
average item in a scale is correlated with the stability of the scale. 

Unexpected, however, are the negative correlations between scale 


stability and mean Stable (r = —.35 and —.10)! This unusual 
finding is explained by the high negative correlations between mean 
Stable and mean item variance (r = —.92 in both samples), and 


the moderate positive correlations between mean item variance and 
scale stability (r — .46 and .24). Mean Stable primarily reflects 
mean item variance rather than mean item stability, and conse- 
quently it is negatively correlated with scale stability. If one were 
to use Stable as an item selection criterion, the result would be to 
lower the over-all stability of the final scale (the psychometric 
paradox) ! The resolution of this paradox will be discussed later. 

The last findings of interest in Table 1 are the generally low cor- 
relations between the four mean item stability measures and scale 
homogeneity. The few significant correlations that do occur reflect 
primarily the correlation of each of the four mean stability meas- 
ures with mean item variance, which is moderately correlated with 
scale homogeneity. 
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concentration of items with high variances will tend to enhance 
scale reliability. For the seales in the present study, the correlations 
between mean item variance and scale stability were .46 and 24, 
while the corresponding correlations with scale homogeneity were 
23 and .11. Although these coefficients were in the expected direc- 
tion, their magnitude suggests that mean item variance is only mod- 
erately predictive of either type of scale reliability. Score variance, 
on the other hand, was more strongly associated with scale reliabil- 
ity, with correlations of .57 and .58 between score variance and scale 
stability, and .67 and .66 between score variance and scale homo- 
geneity. These coefficients, although artifactually inflated due to 
algebraic interdependence between the variables, illustrate the well- 
known conclusion that large score variance is, in general, a neces- 
sary but not sufficient condition for high score reliability. 

A fourth set of relationships involves those among the number of 
items in the scale, score variance, and mean item variance. Number 
of items and score variance were positively related (r — .72 and 
:75), illustrating the constraint placed on score dispersions by the 
number of items on which the scores are based. The correlations be- 
tween score variance and mean item variance should be positive, 
but low, due to the greater dependence of score variance on a 
scale's inter-item covariances than item variances. Score variance 
is simply the sum of the elements in a scale’s inter-item variance- 
covariance matrix which contains m (number of items) variance 
terms and т (m — 1) covariance terms. The relatively small contri- 
bution to score variance from item variances compared to covari- 


variables are similar measures of response stability, their intercor- 
relations should be positive. However, the correlations between 
mean Stable and mean Phi were negative (т = —.37 and —.47), a 
function of the fact that Phi is attenuated for extreme endorsement 


merical value of Phi is attenuated by item extremeness while simul- 
taneously being increased by increases in Tesponse stability, the 
negative correlations between mean Stable and mean Phi suggest 
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` "that the former constraint is more powerful than the latter. When 
mean item variance is partialled out, the partial correlations be- 
tween mean Stable and mean Phi rise to .75 and .50. 
The linéar correlations between mean Stable and mean Ambdex 
"(reflected) are negligible in this study, though non-linearity (eta = 
87 and .42) is evident in these relationships. These relationships are 
similar to those obtained in itemmetric studies (e.g., Goldberg, 
1965; Wiggins and Goldberg, 1965) when many extreme items are 
included in the item pool. When extreme items are progressively 
eliminated, however, the relationship between Stable and Ambdex 
increases dramatically. In the present study, when mean item vari- 
ance is partialled out, the partial correlations between mean Stable 
and mean Ambdex (reflected) were .85 and .48. 

Perhaps the most significant set of relationships illustrated in 
Table 1 are those between the four mean item stability measures 
and the five other scale parameters. Here the focus of interest is on 
the extent to which scale reliability and score variance can be pre- 
dicted from the properties of the average item in the scale.*As Table 
1 indicates, scale retest stability was positively associated with 
mean Phi (r = .45 and .31), mean Phi/Phimax (г = .23 and .19), 
and mean Ambdex (reflected) (r = .32 and .31). These correlations 
are in the expected direction, indicating that the stability of the 
average item in a scale is correlated with the stability of the scale. 

Unexpected, however, are the negative correlations between scale 
stability and mean Stable (r = —.35 and —.10)! This unusual 
finding is explained by the high negative correlations between mean 
Stable and mean item variance (r — —.92 in both samples), and 
the moderate positive correlations between mean item variance and 
scale stability (r = .46 and .24). Mean Stable primarily reflects 
mean item variance rather than mean item stability, and conse- 
quently it is negatively correlated with scale stability. If one were 
to use Stable as an item selection criterion, the result would be to 
lower the over-all stability of the final scale (the psychometric 
paradox) ! The resolution of this paradox will be discussed later. 

The last findings of interest in Table 1 are the generally low cor- 
relations between the four mean item stability measures and scale 
homogeneity. The few significant correlations that do occur reflect 
primarily the correlation of each of the four mean stability meas- 
ures with mean item variance, which is moderately correlated with 
scale homogeneity. 
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Partial Correlations 


The zero-order correlations presented in Table 1 suggest that 
mean item variance functions as a suppressor variable in the rela- 
tionships between scale stability and the various measures of mean 
item stability. In this section, the interrelationships among mean 
item stability, mean item and scale variance, and scale stability 
will be evaluated more completely. Table 2 presents the zero-order 
correlations among these variables, and then presents partial cor- 
relations to evaluate the influence of mean item variance and score 
variance on these relationships. The top three rows of Table 2 show 
the zero-order correlations of scale stability, mean item variance, and 
score variance with mean Stable, mean Phi, and Mean Ambdex 
(from Table 1). The lower three rows of Table 2 show the effect on 
these correlations when mean item variance and score variance are 
partialled out, first separately, then jointly. 


TABLE 2 
Partial Correlations among Scale Parameters 


Mean Stable Mean Phi Mean Ambdex 


M F M F M F 


Zero-order Correlations 
With scale stability 18569 =.101 75,45% 178166, ;82** ..31** 
Withmeanitem variance —.92** —.92** .66** .68** .27**  ,B4** 
With score variance —.897** -—.32** .12 .21** —.06 187% 
First-order partials 
Scale stability —with 
mean MER VA Anne 
partialled out .19* .28** .22** 019% .23** .22** 
Scale stability—with s и 
Score variance 
partialled out —.18* л .47** 2489 .43*%* 22588 
Second-order partials 
Scale stability—with A 
mean item variance 
and score variance 
partialled out .33** .41** .34** .25%* .36** .25** 


* p < .05. 
** p < .01. 


As indicated earlier, the zero-order correlations shown in Table 2 
indicate that mean Phi and mean Ambdex are, in general, positively 


associated with both scale stability and mean item variance, but 
that mean Stable is negatively related to the same variables. These 
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findings suggest that the psychometric paradox (i.e. responses to 
high variance items tend to be unstable) holds when mean Stable is 
used as a measure of average item response stability, but not when 
mean Phi or mean Ambdex is used. 

Turning now to the first-order partial correlations with mean 
item variance partialled out, the correlations between mean Stable 
and scale stability are .19 and .28, similar to the partial correlations 
between scale stability and either mean Phi or mean Ambdex. While 
the relationships between scale stability and mean Stable, mean 
Phi, and mean Ambdex are all influenced by mean item variance, 
the suppression effect is greatest for mean Stable due to the high 
negative correlation between mean item variance and mean Stable 
(r — —.92 in both samples). Note that mean Stable, with mean 
item variance partialled out, is now predictive of scale stability in 
the expected direction (positive), and that the first-order partial 
correlations between scale stability and both mean Phi and mean 
Ambdex, although still in the expected direction, are lower than 
their corresponding zero-order correlations. With mean Цей vari- 
ance partialled out, all three mean item stability measures correlate 
approximately equally with scale stability. 

When one examines the first-order partial correlations with score 
variance partialled out, one can see that the influence of score vari- 
ance is noticeably less than that for mean item variance. In fact, 
for the male sample, the first-order partial correlations between 
scale stability and both mean Phi and mean Ambdex increase 
Slightly over their zero-order correlations. These slight changes stem 
from the lack of correlation between score variance on the one hand 
and both mean Phi and mean Ambdex on the other. 

The bottom row in Table 2 shows the second-order partial cor- 
relations between each of the three mean item stability measures 
and scale stability, with both mean item variance and score vari- 
ance partialled out. The differences among these correlations are 
small, leading to the conclusion that mean Stable, mean Phi, and 
mean Ambdex are equally predictive of scale stability when the in- 
fluences of mean item variance and score variance are partialled 
out. What is even more interesting is the fact that these coefficients 
are not larger, since they reflect the amount of variance in scale 
stability which is predictable from knowledge of item stability. 
Since numerous patterns of item responses can produce the same 
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score on the seale, a given subject could change responses to several 
items on retest without changing his score. Consequently, any cor- 
relation between measures of item response change and score change 
would be less than unity; indeed, if the present data are at all rep- 
resentative, that correlation is only about .30! 


Conclusions 


This study investigated the implications of a psychometric para- 
dox for considerations of scale score reliability. The paradox, a 
finding from itemmetrie studies, is that items which show high re- 
sponse stability over time tend to be low variance items. Since both 
within-individual score stability and across-individual score varia- 
bility are desirable scale characteristics, the psychometric paradox 
could present a problem, if maximizing score reliability is an im- 
portant goal of scale construction. The following conclusions appear 
to be supported by the data. 

1. The psychometric paradox appears to hold true when mean 
Stable'is the item stability measure employed. That is, high mean 
Stable scales tend to show low mean item variance and low scale 
variance (and consequently tend to have low scale stability). 

2. However, with mean Phi or mean Ambdex as a measure of 
item stability, positive correlations with mean item variance are 
obtained—suggesting that the psychometric paradox is dependent 
on the statistic used to measure item stability. 

8, Because of the relative independence of score variance with 
either mean Phi or mean Ambdex, it should be possible to obtain 
simultaneously large score variability and high item stability in 
future personality scales. 
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PROCEDURES AND CRITERIA FOR EVALUATING 
READING AND LISTENING COMPREHENSION TESTS 


EDMOND MARKS 4x» GARY А. NOLL! 
The Pennsylvania State University 


Reapine and listening comprehension tests are two examples of a 
general class of test-taking situations in which the test taker is ex- 
posed to a more or less clearly specified type and amount of infor- 
mation and is then asked to respond to questions based on the ma- 
terial. In this kind of task the tester asks the subject to recognize a 
tule or set of rules which relates the elements of the information 
provided. 

These tests are characterized by (a) an initial stimulus input in 
the form of some aurally or visually presented material, eg, а 
paragraph; (b) a subsequent stimulus input, again in the form of 
some aurally or visually presented material and related to the ini- 
tial input, e.g., a set of questions pertaining to the paragraph; and 
(c) a response or set of responses tied to these two stimulus inputs. 
From the subject’s response(s) the tester infers whether the subject 
has correctly formulated the rule. 

The response (or responses) although tied to the two stimulus in- 
puts, is assumed to be primarily a function of some intervening 
process which the scores of the test are assumed to reflect. This in- 
tervening process is what has been called “reading comprehension” 
or “listening comprehension.” 

It is evident that this kind of task is different structurally from 
tasks relating to achievement where an understanding of the 
rule(s) is assumed and the subject is to deal appropriately with the 
elements (information) (Guttman, 1965). This difference in testing 
Procedures gives rise to a question concerning the appropriate 


—— TT 
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methods for evaluating this partieular type of test. An underlying 
assumption of this paper is that the standard psychometric proce- 
dures—e.g., the estimation of discrimination and difficulty indexes 
and other aspects of item analysis—are not fully appropriate for 
evaluating the typical reading or listening comprehension test. 

The objective of this paper is to develop additional criteria and 
procedures for evaluating measures of the reading or listening com- 
prehension task. Some unresolved questions concerning the nature 
of this ability called “comprehension,” or the effects of methodol- 
ogy—e.g., varying the interval between stimulus inputs or varying 
the mode of presentation—will not be addressed. The lack of com- 
plete or even adequate conceptualization of reading and listening 
comprehension does not, however, preclude the usefulness of this 
construct for certain purposes—e.g., the prediction of academic 
achievement—for which criteria and procedures for evaluating 
reading or listening tasks should be available. 


Formal Development of the Comprehension Task 


Some rather strong assumptions underlie the development and 
use of any comprehension task: 

(1) Response(s) elicited on the comprehension task are inde- 
pendent of specific previous knowledge, either partial or 
complete, or nonknowledge response biases. 

This assumption insures that responses are tied solely to the stimu- 
lus input and certain characteristics of the subject, mainly the abil- 
ity to integrate, abstract, and in other ways manipulate the stimu- 
lus inputs to attain the "correct" response. 

Following from the first assumption we have, 

(2) Any increase in the proportion of a given response 7; over 
chance is attributable to the initial and subsequent stimu- 
lus inputs. $ 

Coupled with the first, this assumption maintains that the observed 
response pattern is determined by the material presented and the 
questions asked concerning this material, and is independent of any 
An previous knowledge the subjects have relevant to that ma- 
erial. 

Our intuitive notion of reading and listening comprehension is 
that it involves something different from mere storage and retrieval 
of data, that it involves the more complex handling, including the 
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transformation and extrapolation of the present data. The strictures 
of this approach require that responses attributable to factors other 
than the given inputs must be considered in error. 

The second assumption also implies changes in the proportions of 
responses as a function of the stimulus input. Any given input for 
which the proportion assigned to each response remains the same 
both before and after its presentation cannot be said to contain 
relevant information. 

Structuring the reading and listening comprehension tasks as we 
have—i.e., as the eliciting of one of a finite number of specified re- 
sponse alternatives as a function of a sequence of stimulus inputs 
and some intervening response process which is independent of pre- 
vious knowledge or any systematic response biases—allows us to 
formulate the following idealized situation. 

For a single trial of a comprehension task—e.g., the visual pres- 
entation of a single paragraph (initial input) to a single subject 
along with one multiple choice question (k specified alternatives) 
relevant to that paragraph (subsequent input)—three possible out- 
comes (A; $ = 1, 2, 3) can be defined. 

Ai—The subject knows (or consistently selects) the “correct” 
answer independently of the initial stimulus input, i.e., 
the subject will select the correct alternative without 
having read or listened to the given passage. 

As—The subject learns the “correct” alternative as a function 
of having been exposed to the initial stimulus input. 

As—The subject neither knows the “correct” alternative nor 
learns it from exposure. 

It is easily determined that these outcomes are mutually exclusive 
and exhaustive, i.e., 


A, ПА. = А, N As = 4. ПА, = © 
and 
Ал U А, М A; = 8, 

where S is the set of all possible outcomes. Similarly, P(A,) + 
Р(А,) + P(A,) = P(S) = 1, where P(A,) is the probability of 
outcome А. 

Let n independent trials of this task be made, e.g., n subjects 
responding to one paragraph and one question regarding that para- 
graph, and let (z, 22, z,) be а 2-dimensional variable where the 
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component т; represents the number of subjects in А. This random - 
variable is 2-dimensional since У); 2; = т, and =, is a linear 
function of the remaining z;. Furthermore, p; = 1 —р, — p» where 
р; is the probability of outcome A,. 

Our problem is the estimation of the parameters Pı, 22, and ps, 
the sampling distribution of these estimators, and the establishment 
of a confidence region for the parameters for any probability value. 

The probability function of this random variable is given by 


zr 3! pp p? 
1, Va, AEREN] 1 P2 Ps > 
which is the familiar multinomial distribution. 


Maximum-Likelihood Estimators for the Parameters 
and Their Sampling Distribution 
Let а sample of size n be drawn with n; the number of sample 
elements in A; such that У)°., n; = n. The likelihood of the sample is 


3 
L(p, p) = II" (1) 
The logarithm of (1) is 


3 
log L(p pa) = т. log pı + m log pa + т log ps = У) т: log ps 
i=l 


Putting the first derivatives of log L(p,, р») equal to 0 and solving 
for the parameters we obtain 


fi =” (2) 


The probability that a given subject drawn at random belongs to 
any of the three categories previously defined is given by the esti- 
mators in (2). 

To determine the variances and covariances of the estimators 
in (2) we make use of a theorem for large samples which states that 
the maximum likelihood estimators (0;) are approximately distributed 
as the multivariate normal distribution with means 0; and variance- 
covariance matrix given by (1/n)Z. The elements of the inverse 
of Z are given by 


2 log f(x; 6,, 6,, --- , 0 
= (ето йш: tuc) 
= 28: ab; 9 
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where Ё is the expected value operator (Mood and Graybill, 1963). 
For the given problem, differentiating log f(x; p: Рг) and taking 
expected values, we obtain 
б; 1 
=H 4 
P: + Ds e 
where 3;; is the Kronecker ô which has the value 1 if 7 = j, and 0 if 
i == j. The value of the determinant |27 is 1/[]i., Pa which 
combined with (3) yields the approximate large-sample distribution 
of the estimators, given as 


1 3 2 2 ô: 1 
(p, P) = 5- 1/TI p:n exp — 2 (+ Р) 


(Pi — pdb; — 02 (5) 


Qij 


The elements of Z are 


са = бир: — РФ; (6) 
From (5) we obtain the large sample variances of the estimators as 
1 
(р) = m [p.(1 — 22] (7) 
and covariances 
1 
opa pj) = 2 (PP) (8) 


One final problem concerning the estimators is to obtain the 
asymptotically smallest; confidence region for the estimators. 


Confidence Region for the Parameters 


Again we make use of the theorem of a multivariate normal dis- 
tribution of the parameters for large samples, and note that the 
quadratie form of a k-variate normal distribution has the chi- 
square distribution with k degrees of freedom (Wilks, 1962). 

We have already determined the coefficients of the quadratic 
form as given by (3). For the particular case, the quadratic form 


U =n Y Sauls — 25; — р) ® 


41 j=l 


- 


is distributed as chi square with two degrees of freedom, with 
PU < x,?) = 1 — y where x,’ is the 100% point of the chi square 
distribution. Replacing аи by (4) and summing (9) with respect 
to j, noting that p, = 1 — pı — рь we find that 
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we Neth np ^. (10) 


np). 
ii пр: 
and 
2 
pL 0 <уг]-1-+ ш) 
ici "pi 

Expression (11) can be used to establish a confidence region for the 
parameters, and because of its equivalence to the likelihood ratio 
test for large samples, can also be used in testing certain hypotheses 
about the estimates. Examples are provided later. 

Having established the maximum likelihood estimators of the p; 
their sampling distribution, and a confidence region, our only re- 
maining concern is for determining the n; appearing in the esti- 
mators. Since these values are not directly observable, procedures 
are needed to obtain them. 


Procedure for Determining the Number in Each Category 


If n is the number of individuals responding to any item, then 


т = та, + па, + па, (12) 


where Ал, Ag, and Аз are as defined previously. 

Since, for any item on the test there is a certain probability of 
guessing that item correctly when the correct answer is not 
“known,” let 


p = probability of guessing correctly. 


On the single administration of a test one can determine only the 
number of individuals who answer correctly and the number who 
answer incorrectly. In this discussion we will show how testing both 
before and after exposure to information allows for the determina- 
tion of each of the unknowns in (12). In the experimental situation 
suggested the initial test situation (pre-exposure) occurs without 
exposing the subjects to the information, i.e., the initial stimulus 
input, on which they are tested. After an appropriate time interval, 
the subjects are retested on the complete reading or listening com- 
prehension task, ie. tested on both the initial and subsequent 
stimulus inputs (post-exposure). These two testings permit the fol- 
lowing values to be determined. 
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Let RR equal the number of individuals who answer a given item 
correctly in both the pre-exposure and post-exposure situations. 
This RR term will be composed of: (1) all the individuals who 
know or consistently select the correct response, (2) the proportion 
who learned it and also guessed it right initially, and (3) the pro- 
portion who failed to learn it but guessesd it right both times. That 
is, 


RR = n4, + pna, + pna, (13) 

Now let WR equal the number who missed the item initially but 

who answered it correctly in the post-exposure situation. This term 

will consist of: (1) the proportion of those who learned the correct 

response but guessed wrong initially, and (2) the proportion of 

those who didn’t learn it and guessed right in the post-exposure sit- 
uation but guessed wrong initially. That is, 


WR = (1 — р)та, + 21 — pna, . Q9 


Let RW be the number who answered correctly initially but who 
answered incorrectly in the post-exposure situation. This term will 
consist solely of those who do not know the correct answer after 
exposure since they will be the only group which answers incorrectly 
the second time. This term will equal the probability of this group 
answering correctly once and incorrectly once times n4, That is, 


RW = р(1 — р). (15) 

The final term WW, is composed of those people who answer in- 

correctly twice and is again composed of people who do not know 

the item after exposure. WW is equal to the probability of guessing 
wrong twice times the number who do not know or learn. That is, 

WW = - р). (16) 


Equations (15) and (16) yield the following solution for p. 


(17) 


RW 
P = WW + Е 
Treating p as a coefficient we combine (13), (14), (15), and 
(16) into one system of equations 
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Na, + pna, + рп, = RR 
(1 — pna, + p — pa, = WR 
pU — pn, = RW 
(1 — pna, = WW 
It is readily determined that the coefficient matrix, i.e., 


(18) 


1 p р 

1—2) pl —p) 

pl — p) 

a- р)" 

and the augmented matrix, i.e., 

1 p p RR 
(1—2) p-p) WR 
2 p-p) RW 
(1 — 2) WW. 


are both of rank 3 so that the system is consistent. Furthermore, a 
unique solution exists for the unknowns, i.e., Nass Na, and nas 
Solving for n4, we have, 


м. = EW E WI ay 


Further algebraie manipulation yields 
nie WR- ЕЕ + Ww) (20) 


and 
n4, = RR — m Q1) 


Ma.) Nan and n4, are the values to be used in the maximum likelihood 
estimation of рл; as given in (2). 

Several assumptions underlying the use of this model should be 
pointed out. First, from (18) it is apparent that a. subject's responses 
to the same item under the pre- and post-exposure conditions are 
assumed independent when he does not know and fails to learn 
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the correct choice. Our initial empirical investigation of the model 
leads us to believe that violation of this assumption may not seriously 
affect the results. Secondly, it has been assumed that exposure has 
only one effect upon the subjects’ responses, that effect being to 
increase the probability of selecting the correct response. Subjects 
who know the correct response prior to exposure are not expected 
to change their responses under the post-exposure condition. It is, 
however, possible that certain properties of the information provided, 
ie, the initial stimulus input, serve to reduce the probability of 
the correct response from pre- to post-exposure conditions. Violation 
of this assumption will lead to such effects as negative n4, values, 
which are clearly inconsistent, and which reflect inconsistent and 
undesirable properties of the items. 


An Example 


In the summer term of 1965, 298 freshmen entering Penn State 
were given a 30-item experimental reading comprehension test. The 
items had been selected from a pool of 60 items used in the develop- 
ment of a reading comprehension test which is presently part of 
the Preregistration test battery at Penn State. It was necessary 
to eliminate some test items from this original pool because they 
had little meaning out of context, ie., without the accompanying 
paragraph. In addition, 15 items used in the experimental test had 
been responded to by the subjects during their preregistration testing 
program. It was possible that this initial exposure, which incidentally 
was not part of the empirical test of the model, would tend to increase 
the n4, for these items. Inspection of the n4,'s for the 15 experimental 
items used in the preregistration reading comprehension test and 
the 15 experimental items not used in that test indicated these values 
to be systematically higher for the preregistration items. It still 
cannot be determined, however, whether this was due to inherent 
properties of the items or to carry-over effects (memory) which 
would be a function of the subjects. 

Table 1 presents the raw and computed data for the 30 experimental 
reading comprehension items. Items indicated by an asterisk were 
part of the preregistration comprehension test. The total n for some 
items did not sum to 298 because of the failure of some individuals 
to respond to those items twice. The column headings RR, RW, WR, 
and WW were explained earlier, while the values in columns na, Я», 
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TABLE 1 
Data Compiled on 30 Experimental Reading Comprehension Items 


Item 
Number RR RW WR WW Total na, пл, та, Par ра, ра, 


ih 224 4 65 5 298 172 110 16 .58 .37 .05 
2* 142 22 90 44 298 96 102 99 .32 .34 .34 


3 105 12 138 43 298 66 161 71 .22 .54 .24 
4 37 43 17 199 296 33-32 294 .11 —.11 1.00 
5* 156 4 92 46 298 148 96 54 .50 .32 .18 
6* 214 1 56 25 296 212 57 27 .72 .19 .09 
pis 195 8 45 50 208 188 43 67 .63 .14 .23 
8* 180 9 35 73 297 176 29 92 .59 .10 .31 
9 48 52 71 125 296 18 27 251 .00 .09 .85 
10 60 37 46 154 297 49 11 237 17 .04 .79 
119 113 25 37 122 297 105 14 177 .35 .05 .60 
12 113 43 37 104 297 98 -8 208 .33 —.03 .70 
18* 55 7 96 138 296 50 94 152 17 .32 .51 
14 181 15 116 36 298 83 143 72 .28 .48 .24 
15* 77 39 75 106 297 49 49 198 .17 17 .66 
16* 9 8 18 262 297 8 10 278 .03 .03 .94 
17 141 24 23 109 297 136 —1 162 .46 .00 .54 
let 10 14 13 261 298 9 —1 290 .03 .00 .97 
19 174 29 25 69 297 163 —6 139 .55 —.02 .47 


20* 183 18 50 46 207 163 45 89 .55 15 .30 
21* 174 13 47 63 207 164 41 92 .55 .14 .31 


22 41 17 25 211 294 39 9 246 .13 .03 .84 
23 78 23 52 145 208 70 34 195 .24 11 65 
24 28 64 11 192 295 24 —71 341 .08 —.24 1.16 


25* 21 7 65 25 298 183 74 41 .61 .25 .14 
20* 219 18 35 26 298 195 29 74 .65 .10 .25 


27 197 9 183 19 298 74 183 41 .25 61 .14 
28 81 37 111 69 298 21 114 163 .07 .38 .55 
29 38 7 207 46 298 7 230 61 .02 .77 .21 
30 36 7 176 79 298 20 184 94 .07 .62 .31 


9.07. .62 .931. 


and та, were computed using equations (19), (20), and (21). Divid- 
ing each of these values by n yields Pax, Pa and Pa, 

Several items require discussion in that negative values were 
obtained, although this is not to be expected. These are items 4, 12, 
19, and 24. Careful examination of items 4, 12, and 19 indicated 
that the passage contained little or no information which aided in 
answering the question. As such, 


К 3 , there was no reason to expect 
an increase in the number of subjects learning the correct response. 


In fact, the negative n4, implies that reading the passage had in 
some way led the subject away from, rather than towards the correct 
answer. The negative n4, for item 24 is predictable from the model 
in that upon further examination of the item it was found that 
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it had been accidentally misseored. Reading of the passage did 
lead to answering the item correctly, but an incorrect alternative 
had been keyed correct. 

"This analysis would lead us to discard items 4, 12, and 19 because 
of ambiguity and rescore item 24. In the following section some 
eriteria are offered for evaluating reading or listening comprehen- 
sion items which are not so clearly inappropriate. 

Before discussing these criteria we will select one item and de- 
velop the standard deviations of the estimators and their correla- 
tion, the 95 per cent confidence region for the parameters, and pre- 
sent an example of a test on a hypothesis concerning the 
parameters. 


Taking item 28 we have pu, = .07, pa, = .38, and ра, = .55. 
Using equations (7) and (8) for the parameters, вау Pa, and Pass 


Figure 1. Confidence region for the true parameter point. 
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we find о(рл,) = .014, o(pa,) = .028, and р(ра,, Paa) = —.26. 
The 95 per cent confidence region for the true parameter point 
(pax) ра.) is plotted in Figure 1. Thus, the probability is about .95 
that, before the sample was drawn, the region we constructed would - 
cover the true parameter point. The values of the p;'s which satisfy 
the inequality in (11) are the points within the ellipse drawn. 4 

As an example of testing hypotheses on the parameters, let us. 
set up the null hypothesis that the outcomes are equally likely, 
i.e., Ho: p; = .83. The likelihood ratio ^ for this hypothesis is given by. 


i I E Ў 

fbi fa Ds у 

The computation of this test is tedious, but since the limiting dis- 
tribution of —2 log A (as n — оо) is the chi square distribution (if 
Н, is true) we may use (10) as a test criterion. For the present ex- 
ample, the chi square value with 2 degrees of freedom is highly sig- - 
nificant, at the .01 level, leading us to reject the hypothesis oi. 
equally likely outcomes. i 


Suggested Criteria for Item Evaluation 


Our intuitive notion of the comprehension task leads us to con- 
clude that tests where scores are unduly influenced by specific Я 
previous knowledge or response biases are invalid measures of this 
ability. 

Items comprising such tests are better conceptualized as achieve- — 
ment type items rather than as the analytical ability items we are 
interested in for the evaluation of reading or listening comprehension 
(Guttman, 1965). The first criterion for the evaluation of reading - 
and listening comprehenison items should be then, that there ате 
few people who can respond correctly to an item prior to the exposure - 
to the selected information. We state this criterion as pa, — 0: 
In terms of the present example, the items which remain after 
eliminating items 4, 12, 19, and 24 have been arranged in Table 2 
according to the degree which this condition is met. Clearly, many 
of these items are unacceptable because,of their high рл, values. 
These items are probably best treated as measuring something other 
than what is commonly called comprehension. | 


The ratio of ра, to pa, + (26) serves аз an index of the 
Фа, + Day 7 


ж 
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TABLE 2 
Difficulty Indexes of Experimental Items 


t ФА, 
Item Number Pas Pas Pas Pa, + Pas 
29 02 Т7 .21 $79 
16 .03 03 .94 .03 
18 03 00 ‘97 ‘00 
9 .06 .09 .85 .10 
30 .07 .62 .81 .67 
28 ‘07 38 155 al 
22 13 03 184 108 
15 Bis 17 66 .20 
10 л]. 04 .79 .05 
18 Ari 32 .51 .39 ` 
3 .22 54 .24 .69 
23 .24 at .65 .14 
27 .25 .61 14 .81 
14 .28 48 .24 .67 
2 .32 34 .34 ^ 44 
11 «35 05 .60 .08 
17 .46 00 .54 .00 . * 
5 150 32 18 104^ 
20 .55 15 .30 33 * 
21 .55 14 .31 .31 
1 .58 37 .05 .88 
8 .59 10 .81 .24 
25 .61 25 14 64 
7 63 14 23 .38 
26 .65 10 .25 .29 
6 72 19 09 .68 


degree to which subjects are selecting the correct response as а 
function of exposure to the information, and reflects what may be 
called the difficulty of the item. The use of this criterion depends 
upon the purposes of the test and theoretical assumptions underlying 
its deyelopment. To maximize the variability of test scores one 
would want this index to be close to .50 (provided pa, is sufficiently 
close to 0); in other situations, however, it may be desirable to 
choose items with varying degrees of difficulty. In the present ex- 
ample, we might consider rejecting items 18, 22, 10, 17, dnd 19 
since the index is quite small indicating that few subjects are “learn- 
ing” the item. 

Use of these two criteria, depending upon the test constructor’s 
needs, provide a rapid and concise evaluation of individual paragraphs 


and their related items. 
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Some Other Uses of the Model у 

Reading and listening comprehension tests are only one example 
of the measurement of knowledge gained during a brief and rather 
pure exposure to selected information. The model can be extended - 
io analyses of longer term learning situations, such as academic | 
courses or other training programs. [ 
Briefly, an analysis of such a situation, say an academic course, 
could be effected by giving an examination designed to cover all 
of the content areas of the course on the first day of class. At the 
conclusion of the course the test could then be readministered. Ж 
Information the instructor could obtain from analyzing these data ^ 
in terms of Pa, Pa. Pa, would include the identification of areas. 
in the course which most students know when they begin the course, 
the identification of areas which are causing difficulty to the students, " 
and also areas where most learning is taking place. | 
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VERBAL INTERESTS AND INTELLIGENCE: 
COMPARISON OF STRONG VIB, TERMAN CMT, AND 
D-48 SCORES OF GIFTED ADOLESCENTS. 


GEORGE 8. WELSH? 
The University of North Carolina at Chapel Hill 


A group intelligence test, the Terman Concept Mastery Test 
(CMT) (Terman, 1956) was given to adolescents attending an 
eight-week residential summer program especially designed for 
gifted high school students in Academic and in Arts areas (Carnegie 
Corporation Quarterly, 1964). It was apparent from the distribu- 
tion of scores that the students in Academic areas had scored higher 
on this test than those in the Arts areas. 

One explanation for this difference was suggested by the essen- 
tially verbal nature of the Terman CMT which consists of 115 
word pairs that must be classified as either synonyms or antonyms, 
and 75 analogies that are mostly in verbal form although there are 
a few numerical items. It is basically a high-level vocabulary test 
and in order to do well a subject must have a good knowledge of 
words and their meanings. Students in the Arts areas were not se- 
lected on the basis of test scores and high school grades as those 
in Academic areas were, but rather because of demonstrated talent 
and skill in painting, drama, dance, or music. The hypothesis was 
tested that a nonverbal intelligence test would not show the marked 
difference that had been shown by the Terman. 

The nonverbal test employed was the D-48 (Black, 1963). This 


1This paper was presented at the Southeastern Psychological Association 


meetings, New Orleans, Louisiana, on March 31, 1966. 
2 The present writer served as research consultant to the Governor's School 
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test utilizes stimulus material in the form of dominoes and requires 
no verbal ability whatsoever (except for the instructions) and in- 
volves only the ability to count dots and to see patterned relation- 
ships. It was found, however, (Welsh, 1966) that the nonverbal test 
showed exactly the same kind of differentiation as the verbal test 
had, and that the magnitude of the difference was at the same level. 
That is, students in the Academic areas also scored relatively 
higher on the D-48 than those in the Arts areas. 

Although the Terman and the D-48 were significantly correlated 
in this group of subjects, the magnitude of the relation (r = .50) 
makes it apparent that there would be differences between the sets 
of scores for many of the students. A possible explanation for such 
differences was sought in verbal interests; that is, it was assumed 
that students who did relatively better on the verbal test might 
have higher verbal interests than those who did better on the non- 
verbal tests. 

This hypothesis was tested using the Strong Vocational Interest 
Blank’ (SVIB) (Strong, 1959); the men’s form had been given to 
the girls as well as the boys at the school and had been scored for 
all of the regular scales. The scales at issue are those in Group X; 
Advertising Man, Lawyer, and Author-Journalist, that Darley and 
Hagenah (1955) refer to as “verbal-linguistic.” 

Three groups of subjects were selected from the 773 students who 
attended the school during two summers of operation and had com- 
pleted both the Terman and the D-48 as well as the Strong. One 
group did relatively better on the Terman, one relatively better on 
the D-48, while the third did equally well on both. The raw scores 
on both tests were converted by formula to standard scores (T- 
scores) and a difference score obtained by subtracting the D-48 
from the Terman. Subjects were then selected from the two ends 
and the middle of the distribution of difference scores; equal num- 
bers of each sex and of students from each summer were selected 80 
that there were 60 subjects in each of the three groups. 

For the high Terman group the mean T-scores were 62 on the 
Terman and 42 on the D-48, for the high D-48 group 43 and 59, 
while the middle group obtained 51 and 50. Thus, in terms of the 
summed T-scores for the two tests (104, 102, and 101) the three 
groups were equated for overall intelligence. Statistics in raw score 
form as well as in T-scores are shown in Table 1. 
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TABLE 1 


Summary Statistics of Intelligence Test Scores 
for Three Groups of Adolescents 


Terman CMT D-48 


Male Female Male Female 


High Terman Group* 
M 90.5 (62) 90.3 (62) 25.3 (42) 23.6 (41) 
SD 33.3 (11) 37.9 (13) 7.1 (11) 8.0 (12) 
High D-48 Group 
M 34.6 (43) 36.6 (43) 34.8 (58) 36.7 (61) 
SD 21.0 (7) 17.9 (6) 44 (7) 4.0 (0) 
Medium Group 
M 61.5 (52) 55.1 (50) 31.3 (52) 29.7 (50) 
SD 23.9 (8) 19.9 (7) 5.0 (8) 44 (7) 

a For each group № = 60, 30 males and 30 females. 


b Raw scores means and standard deviations are followed by T-scores (in parentheses) derived 
from entire population of students. 


Results for the Strong scales are shown in Table 2 where it is clear 
that the group high on the verbal test of intelligence averaged 
higher on all three scales than the group high on the nonverbal test, 
while the group doing equally well on both intelligence tests fell in 
between. All differences reached a statistically stable level of sig- 
nificance as demonstrated by analysis of variance with F ratios ав 
follows: Advertising Man, 4.63 (p between .05 and .01), Lawyer 
11.11, and Author-Journalist, 11.56 (p’s beyond .01). 


TABLE 2 


Summary Statistics of Strong Verbal-Linguistic Scales for 
‘Adolescents Selected on Intelligence Tests 


Ady. Man Lawyer Auth.-Journalist 


High Terman Group* 
M 43.2 43.5 43.8 
SD 10.9 8.7 9.8 
High D-48 Grou; 
M Р 37.5 35.9 36.1 
SD 9.1 8.1 7.8 
Medium Group 
M 40.4 38.8 39.3 
8р 10.7 9.5 8.9 


a For each group № = 60, 30 males and 30 females. 


Further support for the basic hypothesis was sought by а con- 
verse approach. From the total population of students three groups 


P 
e 
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were selected on the basis of their Strong scores on the three verbal- 
linguistic scales; the first group had A+ scores, the second A, and 
the third C on all of these scales, The hypothesis tested was that the 
difference scores (in the T-scores) obtained by subtracting the D-48 
from the CMT would be positive for the А-- group and negative for 
the C group. No attempt was made to equate the three groups for 
size or for sex and they were unequal in both regards. The А-- 
group contained 12 boys and 29 girls, the A group 31 and 61, and the 
C group 50 and 9. 

The mean difference scores fell in the predicted direction: 
А+ 7.87, A 2.58, and C —1.64. An analysis of variance showed an 
F ratio of 9.21 significant beyond the .01 level. 

Thus, a consistent relationship between verbal interests and in- 
telligence has been demonstrated in gifted adolescents: students 
scoring relatively higher on a verbal as compared with a nonverbal 
test of intelligence show higher interest scores on verbal-linguistic 
seales, and, conversely, students with higher verbal interests score 
relatively higher on a verbal intelligence test. It will be necessary to 
replicate these findings with other age and educational groups and 
with other verbal and nonverbal tests of intelligence. Should the 
same kinds of differences appear, there will be important implica- 


tions for measurement of intelligence as well as for vocational in- 
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PATTERNS OF PSYCHOLINGUISTIC DEVELOPMENT 
DURING EARLY AND MIDDLE CHILDHOOD* 


MOHAMMED Y. QUERESHI 
Marquette University 


Over the past five decades a large number of studies have inves- 
tigated the basic regularities in the development of a number of 
psychological functions. Some of these studies have been specifi- 
cally concerned with the changes in the structure and organization 
of human abilities during various stages of life. Most of the early 
studies on changes in patterns of mental development, utilizing fac- 
tor analysis, were summarized by Garrett (1946) who advanced 
the “differentiation hypothesis” as a reasonable generalization of 
the then-existing evidence. Burt (1954) has recently reaffirmed his 
position regarding a similar hypothesis on age differentiation which 
he had originally proposed as early as 1919. The hypothesis indi- 
cates that general ability is fairly unified during the early years of 
life, but becomes fractionated into a “loosely organized group of 
abilities” as age increases. 

An examination of the more recent investigations in this area 
(see review by Anastasi, 1958; Weiner, 1964), however, does not 
provide any conclusive evidence for or against the differentiation 
hypothesis even if it is modified to apply only to the first 18 years 
of life. Apart from the methodological inadequacies that may at- 
tenuate the usefulness of a number of these studies, changes in the 
organization of cognitive or psycholinguistic development over cer- 
tain age spans (214 through 9, for example) have received compara- 
tively little attention. The present study represents an attempt to 
fill this void as well as to provide data that fulfill most of the meth- 


1 Acknowledgement is due to Samuel A. Kirk of the U: of Illinois 
and James J. McCarthy of the University of Wisconsin for the raw 
data for this study. 
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odological requirements that any study of this type may be ex- 
pected to meet. In addition, the data of this study are based on 
tests that stem from a definite psychological theory of the develop- 
ment of cognitive phenomena (Osgood, 1957), whereas previous 
studies concerning this problem have generally employed work- 
sample tests constructed on the basis of practical criteria rather 
than any particular psychological model or theory. 


Method 


The data for this study are based on 700 children tested in con- 
nection with the standardization of the Illinois Test of Psycholin- 
guistic Abilities (ITPA) devised by Kirk and McCarthy (1961, 
1963). The procedure used in selecting the above sample and for 
collecting the necessary data is briefly described below. 

In order to keep the hereditary-constitutional dimensions fairly 
uniform and also to control for linguistic and cultural differences 
that exist because of urban-rural and ethnic distinctions, the sam- 
ple was strictly limited to American-born, urban, white children. Ss 
of school age were randomly drawn from the total relevant popu- 
lation in the public schools of a midwestern community. About 80 
per cent of the preschool Ss were selected from the siblings of school 
cases while the remaining 20 per cent were drawn from those fam- 
ilies in the community whose first or oldest children were still at the 
preschool level. 

The total sample consisted of 14 age groups, beginning at 2-6 and 
continuing, by half-year intervals, to 9-0. Equal numbers of male 
and female Ss (25 males and 25 females) were tested at each of the 
14 age levels. In addition, the sample was restricted to children who 
fell within the IQ range of 80 to 120 on the 1937 Stanford-Binet In- 
telligence Scale, Form L. The ITPA was administered to each S at 
least one day after the administration of the Stanford-Binet. De- 
tailed information about specific sampling and testing procedures 
as well as about such indices as means, ranges, and standard devia- 
tions for each of the 14 age groups on such variables as CA, MA, 
IQ, and social class is available elsewhere (Quereshi, 1964). 


Test Materials 


The ITPA is composed of nine subtests and has been devised for 
the purpose of differential assessment of children’s language devel- 


т 
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opment. Conceptually the test is based on Osgood's (1957) model 
founded on a “behavioristic analysis of perception and language as 
cognitive phenomena." Detailed description of the rationale behind 
the various subtests and the procedures for construction and selec- 
tion of items, ete. are reported elsewhere (Kirk and McCarthy, 
1961; McCarthy and Kirk, 1963). The nine subtests named in ac- 
cordance with the psycholinguistie functions measured by each are: 
Auditory Vocal Automatie (AVAu), Motor Encoding (ME), Au- 
ditory Vocal Association (AVA), Visual Motor Sequencing (VMS), 
Auditory Vocal Sequencing (AVS), Visual Decoding (VD), Visual 
Motor Association (VMA), Auditory Decoding (AD), and Vocal 
Encoding (VE). 


Treatment of Data 


The data of the present study differ in two respects from those 
used to obtain certain statistical indices for the ITPA (McCarthy 
and Kirk, 1963): (a) the subtests as used in the present study in- 
clude all the items that comprised each subtest originally and (b) 
the scoring cutoffs for eight of the nine subtests, unlike those ap- 
plied by the test authors, have been adopted in accordance with the 
indications of a recent study (Quereshi, 1964) concerning the rules 
for selecting an appropriate discontinuance procedure. A compari- 
son of the respective number of items and scoring cutoffs utilized in 
the present study with those employed by McCarthy and Kirk 
(1963) is presented in Table A?. 


Reorganization of Data 


Out of the 14 age groups of 50 Ss each, seven age groups of 100 
each were formed by combining into one group ages 215 and 3, 3% 
and 4, and so on through 8% and 9 years respectively. This step 
was dictated by the need to obtain fairly dependable results in view 
of the nature of the statistical analysis selected for this study. Raw 
scores were available on nine ITPA subtests together with the Stan- 
ford-Binet MAs. In accordance with the procedure and rationale 
described below, four additional variables were introduced. The 


2A 9-page document containing Tables A through I has been deposited 
with the ADI Auxiliary. Publications Project. Order Document No. 9322, re- 
mitting $125 for 35-mm. microfilm or $125 for photocopies. Make checks 
payable to: Chief, Photoduplication Service, Library of Congress. 
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Pearson rs were then computed for these 14 variables for each of 
the seven age groups separately as well as for the entire group of 
700. 


Choice of an Appropriate Method of Factor Analysis and Subsidi- 
ary Considerations 


Most studies on this topic in the past have employed general sum- 
mation methods (e.g., centroid, principal component, etc.) although 
such methods, without appropriate adjustments, are inappropriate 
for a problem of this type. Furthermore, most of the same studies 
have utilized certain rotational procedures for attaining either a 
particular level of “psychological meaningfulness” or for insuring a 
certain degree of blind objectivity despite the fact that such re- 
course renders the forementioned statistical procedures even more 
unsuitable. Burt (1954, p. 86) was perhaps the first one to point out 
the unsuitability of these methods and suggested an alternative 
procedure with an appropriate demonstration of its applicability. 
However, Burt’s procedure involves more labor than the factor 
analytic method employed here. Briefly, the present method is a 
modification of the square root method in which one general factor 
and three or more group factors are constituted beforehand on the 
basis of available empirical and/or theoretical knowledge about 
their nature and composition. A careful examination of the correla- 
tion matrices for the various age levels as well as of the theoretical 
substrata of the variables involved led to the postulation of three 
group factors in addition to the general factor. Table 1 presents the 
four pivotal variables (hypothesized factors), the procedure by 
which they were defined, and the rationale that determined their 
constitution. The correlation of the general factor, A, with each one 
of the other 13 variables was corrected for spuriousness that results 
from part-whole correlation. The 14 X 14 correlation matrices thus 
constituted (see footnote 2, Tables B through E) were subjected to 
& square root analysis (with unities in the diagonal) for each of the 
Seven age groups separately, pivoting on the reference variables A, 
B, C, and D respectively. Similar analysis was carried out on the 
correlation matrix for the combined group of 700 in order to obtain 
an overall picture of the relative prominence of variables with re- 
spect to each of the factors involved. 
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TABLE 1 


Pivotal Variables in Factor Analysis Based on 14 X 14 Correlation Matrices 
for the Various Age Groups 


Label Procedure for Forming Rationale 
A Sum of scores on all 9 subtests А general ability factor should have 
plus Stanford-Binet MA loadings on all variables in a study, 


especially on a global measure such 

as Stanford-Binet MA. 
B Sum of scores оп AVAu, AVA, A group factor should consist of 
and AVS subtests variables that have consistently high 
intercorrelation at most age levels 
and have, according to Osgood’s 
theory (1957), two “channels”. in 


common, 
(9) Sum of scores оп VMS, УМА, The same rationale as for B; however, 
and AD subtests theoretically, AD subtest does not 


belong here or in any other group. 
VMS and УМА have two “channels” 


in common, 
D Sum of scores on ME, VE, Тһе same rationale as for B; although 
and VD subtests VD, theoretically, does not belong 


here or in any other group. ME and 
VE have one “ргосезз’” in common. 


Note.—It. is essential that a group factor be composed of at least three variables. On a theoretical 
basis alone, subtests that have either two "channels" or one “process” in common belong in one 
group. 


Results and. Discussion 


The differentiation hypothesis advanced by Garrett states that 
“Abstract or symbol intelligence changes in its organization as age 
increases from a fairly unified and general ability to a loosely or- 
ganized group of abilities or factors” (1946, p. 373). If this hypothe- 
sis is valid, then an appropriate analysis of mental and/or psycho- 
linguistic abilities across different age levels should yield three 
definite trends: (a) the percentage of variance accounted for by the 
general factor should gradually decrease from age to age, with the 
highest percentage at the youngest and the lowest at the oldest age 
level, (b) the percentage of variance contributed by each of the 
group factors involved should gradually increase with age, from the 
lowest for the youngest to the highest for the oldest age group, and 
(с) the average inter-correlation of the factors should decline as the 
age advances, i.e., the factors should become more independent of 
each other with increasing age. 
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Changes in the General Factor 


Table 2 presents the loadings of the 14 variables on the general 
factor, A, for the seven age groups ranging from 236 to 9 years as 
well as for the total group (combining all ages) .? The percentage of 
variance accounted for by the general factor, from age to age and 
for the total group, is also given in Table 2. It may be noted that 
the per cent of variance, as predicted, is the highest (41.3) for the 
youngest age group (215-8 years) and lowest (22.8) for the oldest 
age group (844-9 years). The data very strongly corroborate part 
(a) of the differentiation hypothesis since there is no reversal in 
this trend. The correlation between the hypothesized ranks and the 
obtained ranks for the percents of general factor variance at the 
seven age levels, as estimated by the Spearman rho, is .991 (p < 
001). Although the differences among percentages of variance at 
the intermediate age levels (410-5, 515-6, and 616-7) are small, they 
are in complete accord with the predicted trend. Moreover, if the 
magnitude of reduction is to be taken into consideration, the sharp 
decline between ages 215-3 and 315-4, and 315-4 and 415-5, and 
614-7 and 715-8 cannot be overlooked. The overall reduction from 
the youngest to the oldest age group is almost 50 per cent if the 
per cent of variance for the youngest age level is used as the base. 

"The loadings of the general factor on variables B, C, and D (con- 
stituting the three group factors) generally follow a descending 
trend from the youngest to the oldest age level, indicating that the 
group faetors become more independent of the general factor with 
increasing age. Further discussion of this finding is presented later 
on in connection with the evidence concerning changes in the in- 
terdependence of factors from age to age. 


Changes in the Group Factors 


Since all the three group factors share a “common fate,” results 
and discussion pertaining to these factors are presented simultane- 
ously. Tables 3, 4, and 5 respectively present the loadings of differ- 
ent variables, together with the percentage of variance, for various 
age groups for the group factors B, C, and D. Examination of the 
percentages of variance for all three factors indicates that generally 
the group factors account for an increasing percentage of variance 


3 For residuals see footnote 2, Tables F through I. 
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TABLE 2 


General Factor (A) Loadings and Percentage of Variance 
for the Various Age Groups 


Age Groups (Years) 


Variable 24-3 344 43-5 54-6 01-7 71-8 81-9 Total (23-9) 


AVAu 594 544 415 454 488 459 451 890 
МЕ 627 367 132 312 342 345 439 788 
633 


AVA 706 674 651 714 564 571 944 
VMS 362 537 476 420 212 162 368 859 
AVS 543 409 299 394 409 205 124 830 
VD 539 541 304 217 329 193 337 831 
VMA 507 549 293 355 334 525 298 862 
AD 538 427 527 352 454 233 244 859 
УЕ 641 592 411 464 431 457 365 824 
MA 562 621 712 638 706 599 556 949 
A 1 1.000 1.000 1.000 1.000 1.000 1.000 1.000 
B 795 696 542 606 520 449 457 

c 632 586 430 494 465 420 931 


658 
697 610 363 410 401 347 484 892 


Per cent of 
Variance 41.3 36.5 27.4 27.0 26.8 22.8 22.8 78.8 


Note.—N for each age is 100; N for the total group, 700. Decimals are omitted; all loadings are 
reported up to three decimals. 

with increasing age as called for by part (b) of the differentiation 
hypothesis. The youngest age group invariably has the smallest per- 
centage for all the three group factors, but the oldest group does 
not necessarily possess the largest percentage of yariance in the 
case of all group factors. 

In order to test the significance of the trends for all group fac- 
tors, Spearman’s rhos were computed between the hypothesized 
ranks and obtained ranks for the corresponding percentages of 
variance for the seven age groups on each group factor separately. 
The correlation coefficients are .821 (p < .02), .856 (р < .02), and 
.607 (p = .08) respectively for the group factors B, C, and D. For 
the first two group factors the trend is highly significant, but for the 
third factor, although the trend is in the predicted direction, the 
level of significance does not reach the acceptable .05 level. On the 
whole, the prediction that group factors would assume greater 
prominence with increasing age seems to be fairly well corroborated. 


Interdependence of Factors 


In connection with the results of Table 2, reference was made to 
the gradual decrease in the dependence of group factors on the 
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TABLE 3 


Loadings of the First Group Factor (B) and Its Percentage 
of Variance for the Various Age Groups 


Age Group (Years) 


Variable 23-3 33-4 445 51-6 63-7 73-8 81-9 Total (24-9) 


AVAu 417 521 572 601 658 544 409 280 
ME 118 100 041 049 —065 —076 102 039 
AVA 420 398 483 443 534 462 305 201 
VMS 078 149 083 032 —012 109 095 069 
AVS 569 650 688 650 691 780 841 419 
VD 100 171 065 072 —070 —087 —096 041 
VMA 127 028 048 088 081 091 —109 046 
AD 203 154 149 189 169 111 166 119 
VE 129 120 087 146 069 095 —052 045 
MA 266 224 294 268 331 227 264 120 
A 000 000 000 000 000 000 000 000 
B 607 718 840 796 854 894 890 341 
с 214 177 160 216 152 165 115 102 
D 194 112. 172 037 048 018 075 

Per cent of 

Variance 9.3 11.3 13.6 13.1 15.1 14.7 13.7 3.3 


Note.—N for each age is 100; N for the total group, 700. Decimals are omitted; all loadings are 
reported up to three decimals, 


TABLE 4 


Loadings of the Second Group Factor (C) and Its Percentage 
of Variance for the Various Age Groups 


Age Group (Years) 


Variable 21-3 31-4 43-5 51-6 64-7 71-8 81-9 Total (21-9) 
EM AUN EM og. 1$:9:95-9 Total (21-9) .— 


AVAu 055 —044 —050 —056 —099 037 130 —023 
ME 020 032 —029 016 108 —030 076 027 
AVA 034 075 120 126 075 034 105 039 
VMS 457 344 460 598 492 518 539 275 
Ls 


—060 —055 —109 —071 —044 —081 —118 —053 

113 105 144 —098 130 032 248 072 

УМА 532 516 614 581 647 533 603 313 
AD 546 698 585 681 567 612 633 346 
VE 119 110 121 024 036 041 092 038 
МА —051 114 248 097 038 168 047 055 
А 000 000 000 000 000 000 000 000 
B 000 000 000 000 000 000 000 000 
с 722 755 794 877 856 870 900 350 


р 11 114 116 004 162 073 194 068 
Per cent of 


Variance 9.7 10.7 12.2 14.1 12.8 12.4 14.4 3.1 


Note.—N for each age is 100; N for the 
reported up to three decimals. 


MOHAMMED Y. QUERESHI 361 


TABLE 5 


Loadings of the Third Group Factor (D) and Its Percentage 
of Variance for the Various Age Groups 


Age Group (Years) 


Variable 243 34-4 44-5 54-6 01-7 74-8 81-9 Total (23-9) 
AVAu 012 009 138 025 061 130 —057 022 


ME 611 754 780 745 689 772 613 473 
AVA 040 093 006 070 —026 032 084 019 
VMS 053 114 018 010 062 —044 058 018 
AVS —033 —110 —145 —106 —112 —139 —009 —057 
VD 528 416 410 518 565 498 488 295 
VMA 088 076 —093 018 —019 120 022 014 
AD —116 134 040 —059 —035 —073 —083 —033 
VE 393 365 535 527 658 693 643 359 
MA 034 026 000 124 162 174 189 045 
A 000 000 000 000 000 000 000 000 
B 000 000 000 000 000 000 000 000 
С 000 000 000 000 000 000 000 000 
D 679 760 918 896 901 934 853 441 

Per cent of 

Variance 9.3 10.8 14.0 13.9 14.9 16.3 12.9 4.6 


Note.—N for each age is 100; N for the total group, 700. Decimals are omitted; all loadings are 
reported up to three decimals. 


general factor. This finding provides only partial evidence concern- 
ing the third deduction from the differentiation hypothesis: the fac- 
tors should become less intercorrelated as age increases. Thus, the 
change in the average intercorrelation of all four factors should 
furnish a fairly complete evidence for or against the third predic- 
tion. A eareful inspection of the appropriate segments of Tables B, 
C, D, and Е (see footnote 2) indicates that the factors do tend to 
become less intercorrelated at later ages. The means of the correla- 
tions among the four factors for the seven age groups are as follows: 
.68 for 215-3, .60 for 316-4, .43 for 416-5, .41 for 5-6, 40 for 615-7, 
34 for 714-8, and .38 for age group 815-9. The Spearman rho be- 
tween the hypothesized and empirically obtained ranks for these 
estimates yields a value of .964 (p < .001). Thus the significance of 
this trend seems to be well established and the prediction is clearly 
borne out. 

Before arriving at a final conclusion about the validity of the 
differentiation hypothesis, on the basis of the data of this study, one 
may examine the possibility that the continuous decrement from 
age to age, in the per cent of the variance attributable to the general 
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factor, may be a sequel of the gradual decrease in the per cent of 
total variance contributed by all of the four factors over the given 
age range. Àn examination of the total per cents of variance (69.6 
for 215-3, 69.3 for 315-4, 67.2 for 414-5, 68.1 for 514-6, 69.6 for 
675-7, 66.2 for 714-8, and 63.8 for 815-9) indicates that such a view 
is questionable. The total variances extracted range between 63.8 
and 69.6 per cents, indicating that they remain fairly uniform from 
age to age. While the descending trend, as tested by the Spearman 
rho between the hypothesized and obtained ranks, is of borderline 
significance (rho = .687, p = .06), the differences are so small in 
magnitude as to be negligible for all practical purposes. 

Another possible cause of the decrement in the influence of the 
general factor may be suspected to reside in the systematic decline 
in the variability of scores on the ITPA subtests from younger to 
the older groups. Evidence presented elsewhere (Quereshi, 1964, p. 
488, footnote 3) definitely negates even this possibility. 


>» Evaluation of the Findings and Their I mplications 


It is clear that the evidence mustered by this study overwhelm- 
ingly supports the differentiation hypothesis within the age range of 
2% to 9 years. A comparison of this study with others (e.g., Balin- 
sky, 1941; Burt, 1954; Clark, 1944; Cohen, 1957; Doppelt, 1950; 
Garrett, Bryan, and Perl, 1935; Weiner, 1964; to mention a few) 
that have investigated the same problem—whether they confirm or 


contradict the hypothesis—would suffer from several drawbacks, 


the most obvious of which Stem from the incomparability of age 


range, type of test materials, methods of analysis, and Ss’ back- 
ground. Hence no attempt is made here to account for the corre- 
spondence or discrepancies between the results of this study and 
those of others. However, it seems justifiable to say that, if the 
methodological rationale of this study is defensible, the validity of 


the differentiation hypothesis, at least during the age range studied 


here, has been amply demonstrated. It is possible that during the 


subsequent years, as indicated by some of the studies (Balinsky, 


1941; McHugh and Owens, 1954) in this area, the differentiation 
of abilities reaches a plateau 


the exact delineation of the с 


tions for both the test constru 


м а А 
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Since the main concern here is with the principles of test construc- 
tion, there is little need to amplify the meaning of these findings 
for certain theories of cognitive or psycholinguistic development. 
For the test constructor, however, the following guidelines seem to 
be indicated: 

a. Tests intended for children up to about 10 years of age should 
be so constructed as to emphasize the general factor during the pre- 
school years and gradually give greater prominence to group fac- 
tors thereafter. However, the general factor should retain its pre- 
dominant role throughout most of the elementary school years. 
Most of the current group as well as individual ability tests seem to 
satisfy this requirement. 

b. Emphasis in constructing tests for the given ages should be 
on measuring group factors which are positively correlated with the 
general factor rather than on attempting to fit the test content into 
a model postulating several independent and more or less equally 
influential factors. The failure of various test batteries consisting 
of subtests designed to measure separate psychological functions 
during childhood years seems to lend further support to this impli- 
cation. 

c. Ability tests for the age levels studied here should be con- 
structed so that factors representing performance on these tests 
gradually become more independent of each other as well as the 
general factor, with the difference that, even at later ages, they re- 
tain some positive association with the general factor but not neces- 
sarily with each other. For a number of ability tests now in use, 
evidence on how they fare on this point is not currently available. 

d. Unless there is sufficient relevant and dependable evidence 
available demonstrating the worthwhileness of tests claiming to 
measure distinct cognitive characteristics, test users will generally 
be well advised to rely on instruments which yield global measures 
of ability whenever a test is selected for use with preschool or ele- 
mentary school children. The success in the past of certain work- 
sample tests can thus be given a factual explanation in the light of 
this conclusion. 


Summary 


The present study was designed to investigate the validity of the 
“differentiation hypothesis” during early and middle childhood in 
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the context of the abilities measured by the ITPA. The Ss were 700 
children divided into seven age groups of 100 each, ranging between 
2 years, 4 months and 9 years, 2 months, and equally divided be- 
tween the sexes throughout the age range. The samples were fairly 
homogeneous since the children were selected so that their CA and 
IQ fell within a specified, constant range at all age levels. 

Correlations among 14 variables (9 ITPA subtests, Stanford- 
Binet MA, and four pivotal variables) were computed separately 
for each of the seven age groups as well as for the total group com- 
prising all ages. Separate factor analyses were carried out for each 
of the groups involved. 

The results clearly substantiate the differentiation hypothesis 
with respect to each one of its three possible deductions: (a) con- 
tinuous decrease in the percentage of variance attributable to the 
general factor as age increases, (b) gradual increment in the per- 
centage of variance contributed by group factors with increasing 
age, and (с) gradual decline in the interdependence of factors in the 
older age groups. The implications of these findings for the possible 
modification of some of the principles of test construction and eval- 
uation were pointed out. 
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Kmx and McCarthy (19612) have recently developed the Illinois 
Test of Psycholinguistic Abilities (ITPA) that was designed to 
identify psycholinguistic abilities and disabilities in children be- 
tween the ages of two and one-half and nine. The experimental 
edition currently being evaluated consists of nine subtests which 
have been selected to pictorially illustrate linguistic strengths and 
weaknesses in children. This test was based on a theoretical model 
of communication processes which had been adapted from Hull’s 
theory of learning by Charles Osgood (1957). 

The research that has been conducted on this experimental edi- 
tion to date, has revealed that this instrument has some potential 
as a diagnostic instrument with various types of handicapped chil- 
dren used in these research projects. Kirk, Kass, and Bateman 
(1962), Sievers (1963), MeCarthy (1963), Olson (1963), Bateman 
(1963), Semmel and Mueller (1962), Kass (1963), Smith (1962), 
Johnson and Capobianco (1957) have largely been responsible for 
the research in this area, including cerebral palsied children, recep- 
tive aphasics, expressive aphasics, deaf children, partially sighted, 
and mentally retarded children. 

The purpose of this study was an attempt to determine the effec- 
tiveness and stability of the ITPA with children who have above 
average intelligence. As indicated, initially this instrument has 
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been positively received by many workers in the field. However, 
little is known about the effectiveness of this test among children in 
the upper range of the intelligence scale. To be generally useful, it 
is assumed that such a test should be applicable over a broad rather 
than a limitel range of intelligence. 


Subjects 


The subjects consisted of 92 children attending Northern Illinois 
University Laboratory and Nursery School who had demonstrated 
that their intelligence was above average through classroom per- 
formance and group intelligence tests that were routinely admin- 
istered in their school program. Students who had not been tested 
because of their newness to the program, or for other reasons, were 
given the Peabody Picture Vocabulary Test. Subjects who had in- 
telligence quotients above 110 were thus selected to be in the study. 
Table one gives additional reference data on subjects. 


› TABLE 1 
Reference Data on Subjects 
Number of Boys Number of Girls 
CA (mos.) М = 50 N = 42 Total = 92 
48-59 8 8 
60-71 7 5 
72-83 12 12 
84-95 1i 8 
96-107 11 9 
108-119 1 0 


In addition to the intelligence tests, each child was also given the 
Massachusetts Vision Sereening Test, and an audiometric sweep 
test to determine normalcy of vision and hearing. No child included 
in the study had any observable defects and were consequently 
assumed to be normal except for their above average intelligence. 


Procedure 


The standardized procedure of administration provided by the 
test authors Kirk and McCarthy (1961b) was used to administer 
the tests. Hach examiner was assigned a specific age group to exam- 
ine in the research project to assure competency within a specific 
age range. All examiners had demonstrated their competency in 
giving this test. 
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Each child included in the study was examined by the same ex- 
aminers twice with a two week period between each test. 

The data were assembled in proper form and checked by the re- 
search director. Pearson product-moment correlations were com- 
puted between test-retest scores of the nine subtests and the Com- 
posite Language Age score. 


Results and Discussion 


Test-retest correlations are shown in Table 2. It can be seen 
that the ITPA had rather high stability on all scales. The mean 
stability coefficient for the nine subtests was .82; with the Language 


Age (composite score) subtest being the highest .97. 


TABLE 2 
Stability Coefficient of the Illinois Test of Psycholinguistic Abilities 
Subtests f 
1. Auditory Decoding .91 y 
2. Visual Decoding 70 
3. Auditory-Vocal Association .88 
4. Visual-Motor Association .83 
5. Vocal Encoding .80 
6. Motor Encoding 84 
7. Auditory-Vocal Automatic .94 
8. Auditory-Vocal Sequencing 80 
9. Visual-Motor Sequencing .70 


Language (composite score) 


Satisfactory levels of reliability, however, are dependent upon the 
type of decisions to be made. To evaluate differences between sub- 
tests for one individual will require higher levels of reliability than 
will decisions regarding differences between groups of children. In- 
evitably, the degree of accuracy required by an examiner will de- 
pend upon the decision he must make, consequently, the adequacy 
of this test must ultimately be left to the examiner. 

It should also be noted that the range in age, mental ability (al- 
though limited), as well as the short period between the original 
and retest, are all factors that have increased the coefficients re- 
ported in the study. Consequently, the coefficients listed in Table 
2 should be considered as maximum rather than minimum values 
for this test with this type of exceptional child. 

In general the present study suggests that the Illinois Test of 
Psycholinguistie Abilities does have sufficient stability in its sub- 
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tests to justify the use of this test as a diagnostic instrument with 
individual children who have above average intelligence. 
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Ir would appear reasonable to expect that the recent emphases 
on and requirements for evaluation of exemplary and innovative 
elementary and secondary school programs that have been under- 
written by both federal agencies and private foundations may have 
some impact upon the evaluation of programs in higher education. 
Although much has been written during the past thirty years about 
the operational statements of educational objectives, the construc- 
tion of tests and scales around these objectives, and the implica- 
tions of the resulting measurable outcomes to problems of learning 
diagnosis, curriculum design, and modifications in instructional 
strategies, faculty members and administrators in colleges and uni- 
versities have for the most part given little systematic attention to 
the evaluation of their educational efforts perhaps because they 
have lacked the experience necessary to state their educational 
goals in a language form that is amenable to obtaining valid and 
reliable measures of the desired outcomes of the instructional proc- 
ess as well as of other relevant activities. The existence of many 
special educational offerings such as those involving foreign stu- 

1The contents of this article are based in part on an invited paper pre- 
sented by the senior author to a meeting of directors of Foreign Student Pro- 
grams and other administrative offices in student personnel services of the 


California State Colleges, which was held at California State Polytechnic 
College, Voorhis Center, San Dimas, October 16, 1966. 
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dents or those painstakingly prepared by certain professional 
schools suggests the need for developing approaches to finding valid 
measures of behaviorally-stated objectives so that appropriate 
evaluations can be made of the effectiveness of these programs. 
Thus the central problem of this paper is to present a tentative 
paradigm that will show how valid measurable objectives may be 
developed for and utilized in the evaluation of the effectiveness of 
college programs—a paradigm that will permit the potential real- 
ization of improvements in the validity of admissions procedures 
and in the evaluation of student success in terms of a variety of ob- 
servable outcomes. 


Major Purposes and Assumptions Underlying Evaluation 


Prior to formulating a paradigm, an orientation to the basic 
purposes and assumptions underlying evaluation may be helpful. 
Although set forth twenty-five years ago, Tyler’s (1942) statements 
concerning the purposes and assumptions of evaluation constitute 
in the writers’ opinion one of the most direct and meaningful set of 
statements that has been made to date. Other clarifying presenta- 
tions have been those of Dressel et al. (1961), Lindvall (1964), and 
Fivars and Gosnell (1966). 

Tyler cited six major purposes of evaluation which the writers 
have paraphrased to be as follows: (1) furnishing a periodic check 
concerning how effective an institution is and at what point im- 
provements in its educational program are essential, (2) validating 
hypotheses relative to which an educational institution functions, 
(3) yielding information essential to making guidance of individual 
students effective, (4) affording a degree of psychological security 
to students, faculty members, and parents, (5) allowing a means 
for maintaining and improving public relations, and (6) assisting 
students and teachers to clarify their goals and to perceive con- 
cretely the directions which they are taking. Underlying evaluation 
are six major assumptions according to Tyler which have been re- 
formulated in the writers’ words as follows: (1) education is a proc- 
ess in which change of behavior patterns is an essential character- 
istic; (2) the kinds of changes and patterns of human behavior that 
the college (or school) seeks to effect represent its educational ob- 
jectives; (3) the appraisal of educational programs is realized by 
ascertaining to what extent the objectives of the programs are being 
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attained; (4) the manner in which a student organizes his patterns 
of behavior is an essential feature to be determined; (5) methods of 
evaluation embrace not only pencil-and-paper tests but also any 
other devices that furnish valid evidence concerning the degree to 
which students have attained educational objectives; and (6) max- 
imum values are to be gained in the evaluation process when teach- 
ers, students, and parents all participate. 


Major Steps in the Evaluation Process 


In the paradigm portrayed in Figure 1, the key steps in the proc- 
ess of evaluating educational programs are presented essentially in 
the temporal order that they would ordinarily assume, although 
some variation in the sequence is to be expected. These steps may 
be outlined and elucidated as follows: 


I. Statement of Broad Goals of the Educational Program as a Nec- 
essary Basis for Evaluation. 


First, it is necessary to set up the broad goals and general pur- 
poses of the educational program that reflect the philosophical 
values held by the students, faculty, and administration of the 
given school or college with respect to the desired outcomes of the 
educational experience—intellectual, aesthetic, social, and voca- 
tional. Such goals are related to the types of curricula emphasized 
by the educational institution, the unique economie and cultural 
needs of the geographical region, and societal expectations. Societal 
expectations reflect the judgmental positions and personal biases of 


sui 
Flow Chart Showing Basic Steps in the Evaluation Process in Educational Programs 
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both professional personnel (e.g., teachers and administrators) and 
lay individuals (e.g. legislators, members of boards of trustees, 
television commentators, and alumni groups) concerning the appro- 
priateness of what they perceive to be the inputs for anticipated 
outcomes of the educational program. 

The curriculum of a college affords a helpful example of areas of 
concern to both the professional and lay groups. In this domain at- 
tention would need to be given to (1) the relative emphasis to be 
placed on academic, professional, vocational, commercial and 
business, or other experiences; (2) opportunities for social inter- 
action, aesthetic experiences, work programs, and individual crea- 
tive growth; and (3) civic and family responsibilities. Preferably 
some sort of paradigm as represented by flow charts, diagrams, and 
organized listings of major parameters or factors should be made to 
show the pattern of their interrelationships as in a model portraying 
the basic input features of the educational program thought nec- 
essary to achieve certain desired outputs or outcomes. Orientation 
of the faculty and students to the basic goals of the educational 
program is necessary prior to the formulation of any concrete ob- 
jectives and subsequent to the evaluation of their attainment. If 
faculty members and students have contradictory or vague concep- 
tions concerning the underlying philosophy of the educational pro- 
gram and of their respective responsibilities and roles, specific be- 
havioral objectives may be difficult to formulate, and the validity of 
any outcome (criterion) measures may well be indeterminate. 


IL. Development of Specific Behavioral Objectives. 


Once the broad goals of a program have been established, specific 
behaviorally-stated objectives which are consistent with the value 
system of the educational institution need to be formulated in terms 
of observabe changes of behavior. As should be known to most 
readers, an objective is a description of an intended or planned 
change in the behavior of students—a statement of what the stu- 
dent is anticipated to be like when he has completed a given set of 
learning experiences or after he has exposed himself to certain de- 
finable activities in the campus environment. A statement of an 
objective includes (1) an identification of the intended behavioral 
change, (2) a detailed operational definition involving observable 
and measurable changes in behayior—a definition consisting of ac- 
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tivities that carry unambiguous meanings and allow wherever pos- 
sible a numerical or quantitative expression of the behavioral 
changes measured, and (3) the establishment of standards or cri- 
teria of what are considered acceptable changes in the light of a 
consistent, meaningful, and workable philosophy for the educa- 
tional institution. Furthermore, professional personnel (usually pro- 
fessors with the possible solieited help of a psychologist or evalua- . 
tor from the campus testing center) need to determine which 
objectives are relevant and to set priorities concerning the relative 
importance and significance of the objectives to the program. Wor- 
thy of consideration are the amount of time and the extent of hu- 
man and material resources available for the evaluation process, a8 
in most practical educational situations only a limited number of 
objectives can be represented in most of the testing and observa- 
tional devices employed. 


A. Criteria Which Objectives Usually Need to Meet ў 


Behaviorally-stated objectives usually need to meet the follow- 
ing criteria: (1) exhibit relevancy to the basic philosophical tenets 
and values of the educational institution and school community— 
that is possess a high enough priority in light of the personal judg- 
ments of the professional individuals involved in evaluation to be 
included as matters of significant concern; (2) satisfy the immedi- 
ate as well as long-range needs of the intra-institutional and extra- 
institutional culture; (3) show compatibility with the students’ 
states of readiness or preparatory sets to learn; (4) reveal realistic 
correspondence to what can be effectively taught in terms of avail- 
able physical facilities, staff qualifications, library holdings, and 
fiscal capability; and (5) facilitate interaction of student and 
teacher by providing the basis for a continuous two-way feedback 
that will serve to motivate and direct learning. 


B. Examples of Broad Goals and of Specific Behavioral 
Objectives 


Examples to distinguish between broad goals or objectives on the 
one hand and specific behavioral objectives on the other may be 
helpful in grasping the difference in the meanings intended for the 
first and second blocks of Figure 1: 
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Broad Goals or Objectives 


1. To develop an understanding of the nature of the scientific 
method. 

2. To write and speak correctly and effectively. 

3. To gain appreciation of the heritage of music and art of 
Western civilization. 


Specific Behavioral Objectives 


1. (a) To draw accurate inferences concerning possible cause 
and effect relationships from tabulated data on an ex- 
periment in photosynthesis; (b) to design an experi- 
ment intended to test the validity of Newton's second 
law; or (c) to predict the tensile strength of a piece of 
metal from a measure of its hardness. 

2. (a) To show accurate use of noun and verb forms; (b) to 
punctuate a paragraph correctly; (c) to compose а 

A letter of job application correct in form and free of 
grammatical errors. 

3. (a) To attend five or more musical events (without com- 
pulsion) held on the college campus; (b) to read for 
pleasure three or more books during the school year on 
the history and theory of visual art; (c) to participate 
in the campus choir at least one evening per week with- 
out receiving unit credit. 


III. Translation or Transformation of Specific Behavioral Objec- 
tives Into a Form That is Applicable to Facilitating Learning 
(Inducing Behavioral Changes) in the Classroom or in the 
College Environment. 


Represented by the four parallel blocks arranged in a column 
within Figure 1, the totality of activities on the campus environ- 
ment is included, although the relatively formal process (e.g., teach- 
ing behaviors, demonstration of a laboratory experiment, presenta- 
tion of films, and group discussions) and content (information or 
subject matter communicated) characteristics of instructional strat- 
egies in the classroom are probably the most important or at least 
professionally the most acceptable manifestations of specifie be- 
havioral objectives toward which conscious efforts are being ex- 
pended in the college environment. This third step in the evalua- 
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tion model basically argues for the provision of a wide variety of 
formal and informal learning situations which will facilitate the 
realization of previously accepted and carefully stated behavioral 
objectives. 

From inspection of the four parallel blocks underlying the third 
phase of the evaluation model, the specifie behaviorally-stated ob- 
jectives should be expected to serve many of the following purposes: 
(1) establish bases for the measures employed in the selection and 
placement of college students (for if the objectives represented in 
the measures employed in the selection and placement of students 
match those displayed in the curriculum and emphasized in instruc- 
tional activities, it can be expected that the validity of those ad- 
missions procedures involving tests and scales can be substantially 
augmented) ; (2) serve as a guide to orienting and counselling stu- 
dents throughout their period of enrollment; (3) assist the college 
teacher in the development of a vital curriculum and in the selec- 
tion, organization, and presentation of learning experiences—that is 
in employing a variety of instructional strategies that will effect the 
attainment of the specific behaviorally-stated objectives; (4) aid 
instructors to assess student progress, to identify areas of relative 
strength and weakness in student performance, to diagnose stu- 
dents’ difficulties in learning, and to make adequate provisions for 
individual differences; (5) provide students a basis for self evalua- 
tion and self improvement as well as the opportunities for creative 
expression; (6) assist teachers in identifying the weaknesses and 
strengths in the procedures and products of their own instruction; 
(7) encourage development of supplementary and compatible ex- 
tracurricular experiences in which students and student groups may 
participate; (8) furnish a framework around which the public rela- 
tions of a school or college with the outside community can be facil- 
itated; and (9) provide the basis for the simultaneous evaluation of 
many dimensions of the effectiveness of the educational program— 
the identification of areas of instructional and service activities in 


; need of improvement or modification. 


IV. Selection and/or Construction of a Variety of Instruments and 
Measures to Furnish Data Allowing Inferences to be Made 
Concerning the Extent to Which Specific Behaviorally-Stated 
Objectives Have Been Attained. 
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The instrumentation requirement represents the fourth major | 
phase of the evaluation model as portrayed in Figure 1. Evaluation 
ean be no more accurate than the degree to which the measures em- 
ployed will permit valid inferences to be drawn concerning changes 
in behavior and the significance of these changes in relation to the 
total educational program. 

Representative of some of the measuring devices are teacher- 
made objective and essay tests, standardized achievement tests, 
written assignments, direct observations of critical incidents, rating 
scales, and certain standardized aptitude and classification tests 
customarily administered by testing and counseling offices for pur- 
poses of admissions, placement, and guidance. Periodic revisions 
and refinements in locally-devised scales—especially in the instance 
of teacher-made tests and rating forms—are to be expected as in- 
structors wish to incorporate improvements from semester to se- 
mester on the basis of their own teaching experiences. An abundant 
literature exists concerning how tests and scales may be constructed 
or selected (e.g., see Adams (1964) or Thorndike and Hagen (1961)). 

Representatives of other indices for assessing program objectives 
are such types of measures as frequency of absences, awards and 
special honors, grade point average, numbers and ratings of original 
products in the arts, articles accepted for publication in the college 
newspaper or in professional journals, extent of participation in 
school-sponsored social gatherings, number of books checked out of 
library, hours spent in service activities, numbers of referrals to ad- 
ministrative offices, frequencies of peer nomination, and numbers of - 
elective student body offices held (Metfessel, 1965). Д 


V. Periodic Observation and Administration of Tests and Scales. 


Base lines need to be established and sufficient data gathered to 
permit Subsequent assessment of change in behaviors judged to be 
associated with specific objectives. 


VI. Determination of Behavioral Changes Relative to Specific 
Objectives. 


This sixth phase of the evaluation model pertains primarily to the 
analysis of data that are based wherever possible on objective 
measures. Such data analysis precedes formulation of any infer- 
ences, interpretations, or conclusions concerning what the data 
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mean. Three fairly typical approaches to data analysis are con- 
cerned with: (1) making comparisons of the performances of indi- 
viduals or of student groups with normative samples—steps involv- 
ing the accumulation of descriptive data and the preparation of 
profile summaries; (2) determining the amounts of change relative 
io an earlier administration of the same instrument or of a com- 
parable form of a test or scale—a procedure often involving finding 
the significance of differences between mean or proportions as in 
eritical ratio tests, analyses of variance, or analyses of covariance; 
and (3) finding intercorrelations among various measures that may 
suggest certain patterns of interrelationships—a phase often involv- 
ing the calculation of criterion-related coefficients of correlation, the 
execution of multiple regression analyses, or the completion of fac- 
tor analyses or discriminant function analyses. 


УП. Interpretation of the Data Measures Relative to Both Specific 
Behaviorally-Stated Objectives and Broad Goals. 


Perhaps the most difficult phase, but certainly the most important 
aspect of the evaluation process, is that of the interpretation of the 
data furnished by the measures. In other words, what kinds of in- 
ferences, implications, and conclusions of practical significance can 
be drawn concerning the amount of progress, the direction of 
growth, and the effectiveness of the total educational program both 
for groups of students and for the individual student? What do the 
findings really mean? How can values or judgmental standards be 
applied to each of the measures to determine how adequately each 
objective has been realized? In short, the validity or success of the 
educational program depends upon how closely the observed per- 
formance on each of the several outcome (criterion) measures 
meets or fits a previously valued (judgmental) standard of what is 
considered a minimal level of satisfactory performance. On the 
other hand, the validities of the specific measures or scales employed 
in the evaluation process are more likely to be concerned with how 
actual changes in behavior with respect to each of the measurable 
objectives conform to predicted changes that were derived from 
either preliminary achievement (base line) measures or from ini- 
tially-administered tests of scholastic aptitude or achievement 
thought to be prognostic of college success. Moreover, to answer the 
several questions just posed, sufficient time must be allowed to per- 
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mit attainment of the educational objective specified, as the longer 
time span permits relatively greater generalizability of the per- 
manence of observed changes and of the impact upon the total edu- 
cational program. On the other hand, there is some risk that long- 
time segments may conceivably introduce marked confounding in 
the interpretation of results, as the outcomes, however permanent, 
may reflect numerous influences (other than classroom activities or 
other relatively well-controlled experiences) which may have been 
highly instrumental in the students’ having achieved the specific 
objectives set forth. 


VIII. Recommendations Leading to Further Implementation of the 
Objectives in the Educational Program or to Suggested Modi- 
fications in these Objectives. 


The feedback phase of evaluation that permits the determination 
of the relationship of observable outcomes to antecedent events such 
as instructional strategies, counselling experiences, or indices em- 
ployed in admissions decisions can suggest modifications that may 
be made either in the (specific as well as broad) objectives them- 
selves (which may have been unrealistic) or in the ways in which 
they can be more effectively attained if these objectives are still 
judged to be reasonably important. Thus the cycle of evaluation be- 
gins once again and follows the series of steps just outlined. Of 
course, within given semesters and for particular courses the per- 
ceptive professor will make his own modifications in his instruc- 
tional approaches, in the curricular materials employed, and in his 


future examinations on the basis of the feedback he obtains from his 
own devices of evaluation. 


Some Specific Recommendations for Making the Process of Evalua- 
tion Effective. 


: In the process of evaluation there are a few common-sense cau- 
tions which if observed may lead to the preservation and possible 
enhancement of the validity of measurable objectives and thus to 
tangible gains in what are judged to be important characteristics 
underlying the suecess of college and university programs. These 
recommendations include: (1) adequate time allowance for test 
taking experiences; (2) clarification of directions for taking objec- 
tive tests and for completing course assignments; (3) provision for 
open book examinations, field experiences, and independent study 
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—activities that permit a widening of objectives to be sampled and 
measured, and (4) respect for the dignity and self-esteem of each 
student as a minimum requirement to sustain his motivation and 
interest. 


Summary 


A paradigm involving eight basic steps in the evaluation process 
has been presented. In the evaluation of educational programs at the 
college and university level emphasis has been placed upon the de- 
velopment and use of operationally-stated objectives which can be 
translated into meaningful learning experiences. Behavioral changes 
arising in association with formal and informal educational experi- 
ences in turn can be assessed through use of appropriate tests and 
measures that also have been anchored to the same objectives. On 
the basis of periodic observation and the administration of tests and 
scales over a sufficiently long time interval, behavioral changes may 
be determined analytically and interpreted. Careful study ef the 
data may permit the drawing of relevant inferences and meaningful 
conclusions regarding the significance of the observed behaviors in 
relation to both specific and broad goals of thejeducational program. 
Once values and meanings have been establishéd on the totality of 
collated measures and observations, recommendations can be made 
that will allow further implementation of the educational program 
as well as necessary modifications in the goals and objectives. 
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SYMBOLIC TESTS A8 PREDICTORS 
OF HIGH-SCHOOL GRADES 


MARY L, TENOPYR* 
North American Aviation, Ine. 
El Segundo, California. 


Ir was the purpose of the investigator to determine the relation- 
ships between grades in selected ninth-grade subjecta and scores on 
various tests, including experimental tests of symbolie memory 
(Tenopyr, 1966), established structure-of-intellect (SI) synfbolic- 
cognition tests (Guilford and Hoepfner, 1966), the School and Col- 
lege Ability Tests (SCAT) (Educational Testing Service, 1057), 
and some of the Sequential Tests of Educational Progress (STEP) 
(Educational Testing Service, 1957b). 

Method, Subjects (S's) comprised 115 male and 151 female tenth 
graders from a suburban high school in a large metropolitan area 
and were selected primarily on the basis of the availability of 
SCAT-STEP scores, which were in the school's records only for 
students who had attended a local school the previous year. The 
subjects’ standard deviations for the SCAT-STEP series were ap- 
proximately equal to those reported for urban-ninth graders (Edu- 
cational Testing Service, 1963), whereas their means for these testa 
were generally about .5s higher than those of the publisher's norm 


group. 

Dependent variables were (a) freshman grade-point average 
(GPA), (b) world history grade, available for 219 S's assigned to 
the course on the basis of an achievement test, and (е) grade in the 
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regular English class, to which 173 S's had been assigned as a result 
of teachers! ratings. 

Each symbolic test employed as an independent variable was de- 
veloped on the basis of the SI model (Guilford and Hoepfner, 1966) 
and was one found to be factorially univocal for the S's (Tenopyr, 
1966). Symbolic-cognition tests were as follows: Circle Reasoning, 
Correct Spelling, Number-Group Naming, Word Patterns, Word 
Relations, and Numerical Operations—Multiplication. The latter 
test was considered cognitive, although not presently classified as 
such in the SI system. The first five tests, in SI terms, represented 
cognition of symbolic systems, units, classes, implications, and rela- 
tions, repectively. 

The following seven short symbolic-memory tests were used: 
Consonant Span II, Memory for Listed Nonsense Words, Memory 
for Order of Listed Numbers, Memory for Word Classes, Memory 
for Word Transformations, Number-Letter Association, and Simi- 
lar Word Changes Cross-out. The span memory test represented a 
relatively specific factor, whereas the other six tests measured, in 
SI nomenclature, memory for symbolic units, systems, classes, 
transformations, implications, and relations, respectively. Reliabil- 
ity estimates for the symbolic tests ranged from .55 to .90. 

The published tests used were SCAT—Verbal, SCAT—Quantita- 
tive, STEP—Mathematics, STEP—Reading, and STEP—Writing. 

Twelve multiple regression analyses, one for each of the three de- 
pendent variables and each of the following combinations of inde- 
pendent variables: all tests, symbolic-cognition tests only, sym- 
bolic-memory tests only, and SCAT-STEP variables only, were 
accomplished. 

Results. Despite generally lower individual coefficients of corre- 
lation with the dependent variables than the SCAT-STEP vari- 
ables had (Tables 1 through 3), the symbolic tests were only mod- 
erately correlated among themselves and, thereby, provided R’s in 
some cases as high as those for the SCAT-STEP tests (Table 4). 
R’s for all combinations of variables were highly significant (p < 
001), and for each of the three dependent variables, the R involving 
all tests was significantly (p < .05) higher than those for the 
smaller combinations of predictors. The lowest R’s, although mod- 
erate in magnitude were found when the symbolic-memory tests 
alone were employed as independent variables. The R’s involving 
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TABLE 4 
Multiple Correlations for Tenth-Grade Students 


Criterion 
J er. 


World History English 


GPA Grade Grade 
Independent Variables (N = 266) (N = 219) (N = 173) 
All tests .76 .69 .68 
Symbolic-Cognition tests only .70 .59 .57 
Symbolic-Memory tests only .57 .52 .45 
SCAT-STEP tests only -70 .60 .60 


the symbolic-cognition tests as the only predictors were comparable 
to those found when the SCAT-STEP variables were used alone as 
independent variables. 

In general, the R’s for the smaller combinations of variables were 
achieved without resort to suppressor variables. Of the 54 regression 
coefficients computed for the subsets of independent variables, only . 
five were negative. Suppression effects were more prominent in the 
three analyses in which all tests were involved; in these, twelve of 
54 regression coefficients were negative. 

In each of the three analyses employing all independent varia- 
bles, at least some of the symbolic-cognition or symbolic-memory 
tests had significant, positive partial R’s with the dependent varia- 
ble. 

Discussion. The fact that a battery of short symbolic-cognition 
tests, not specifically designed for predicting grades, was as valid as 
а comprehensive battery specifically developed for measuring and 
predicting academic achievement and the finding that symbolic 
tests could significantly augment the prediction of grades afforded 
by the longer battery raise some questions relative to typical meth- 
ods used in predicting scholastic achievement. 

It would appear likely that if curricula were analyzed carefully, 
relatively short and independent aptitude tests could be written or 
chosen so that academic achievement could be predicted with far 
more accuracy than is possible when traditional methods are em- 
ployed. The use of short factor-pure tests could probably be useful 
in differential prediction of grades. 

Although one symbolic-memory test had a significant positive 
partial R with GPA when all independent variables were utilized, 
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and symbolic-memory tests in optimal combination afforded a 
moderate level of prediction of grades, it does not appear that 
symbolic-memory tests, in general, can contribute so much as 
other types of test to improving prediction of scholastic success. 
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CONCURRENT VALIDATION OF GUILFORD'S SIX 
CONVERGENT TESTS 


GEORGE WINDHOLZ 
University of North Carolina at Charlotte 
AND 
WILLIAM A. McINTOSH 
North Carolina State University at Raleigh 


Тнв objective of this study was to determine the concurrent va- 
lidity of a number of convergent tests and their composite, semantic 
in content, as developed by Guilford and associates, within the 
framework of the Structure of Intellect model (Guilford and Hoepf- 
ner, 1963). 

Guilford's three dimensional model, Structure of Intellect, hy- 
pothesizes 120 independent factors of intellect. Each factor is de- 
fined in terms of three categories: operation, content, and product. 
For the purpose of this study, the selected factors are all semantic 
in content, and are subjected to two operations—divergent and con- 
vergent. Divergent operation is defined as “generation of informa- 
tion from given information, where the emphasis is upon variety 
and quantity of output from the same source,” while convergent 
operation is defined as “generation of information from given infor- 
mation, where the emphasis is upon achieving unique or conven- 
tionally accepted best outcomes.” (Guilford and Hoepfner, 1963, 
p. 2). Furthermore, each of Guilford’s factors yields a product, a re- 
sult of processed information, and the intellect model specifies six 
products. 

This study used twelve independent factors which were semantic 
in content. Six of these factors were divergent in operation and six 
were convergent in operation. 

Guilford’s definition of convergent operation specifies the gen- 
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eration of a unique or best outcome from given information. This 
skill is required by items included in the conventional measures of 
"intelligence" or general scholastic aptitude. On the other hand, 
Guilford and Merrifield (1960) point out that divergent operations 
are an essential element in creative thinking. Since Guilford’s test 
are intended to be independent, it follows that the correlation be- 
tween semantic convergent tests and the verbal part of a validated 
test of general scholastic aptitude probably would be larger than the 
correlation between the semantic divergent test and the verbal part 
of a validated test of general scholastic aptitude. Furthermore, 
since Guilford makes a distinction in terms of the content of a test, 
it follows that the correlation of convergent tests, semantic in con- 
tent, should be expected to be greater with the verbal part of a test 
of scholastic aptitude than with the quantitative part. 

For the purpose of the concurrent validation of Guilford’s tests, 
the Verbal (V-SAT) and Quantitative (Q-SAT) scores of the 
Scholastic Aptitude Test were used. 

The following hypotheses were postulated: 


Hypothesis 1. The correlation of each of the six convergent 
tests with the V-SAT score will be greater than the correlation 
of each of the six divergent tests with the V-SAT score. 
Hypothesis 2. The correlation of a composite of the six con- 
vergent tests with the V-SAT score will be greater than the cor- 
relation of a composite of the six divergent tests with the V- 
SAT score. 

Hypothesis 3. The correlation of each of the six convergent 
tests with the V-SAT score will be greater than the correlation 
of each of the six convergent tests with the Q-SAT score. 
Hypothesis 4. The correlation between the composite of the six 
convergent tests will be greater with the V-SAT score than the 
correlation of six convergent tests with the Q-SAT score. 


Subjects and Procedure. The sample for this study consisted of 
165 male and female undergraduate college students enrolled in an 
introductory course of psychology at the University of North Caro- 
lina at Charlotte during the 1964-1965 academic year. 

Administration of the 12 convergent and divergent tests, devel- 
oped by Guilford, was conducted in the classroom during tivo ses- 
sions. The procedure for administration adhered strictly to the 
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guidelines specified by the test constructors. The order of admin- 
istration was as follows: 


First Session: 
Factor DMC Test: Alternate Uses 
Factor ММС Test: Word Grouping 
Factor DMI Test: Possible Jobs 
Factor NMI Test: Sequential Association 
Factor DMS Test: Expressional Fluency 
Factor NMS Test: Picture Arrangement 


Second Session: 
Factor DMU Тез: Ideational Fluency I 
Factor ММО Test: Picture-Group Naming 
Factor DMT Test: Plot Titles (clever) 
Factor NMT Test: Gestalt Transformation 
Factor DMR Test: Associational Fluency I 
Factor NMR Test: Inventive Opposites 


All tests were scored by one individual in accordance with the 
scoring procedure outlined by test constructor. The reliability of 
each test was determined by correlating (Pearson's r) the alternate 
formats of each test, with the exception of the Picture-Group Nam- 
ing test where the reliability was estimated by the split-half tech- 
nique. Tests were corrected by Spearman-Brown formula. Reliabil- 
ities for the divergent and convergent composites were estimated by 
the Mosier formula (Mosier, 1943). 

The comptuation of composite scores for the divergent and con- 
vergent tests proceeded as follows: the scores of each test were sub- 
jected to non-linear transformation into percentile scores and then 
transformed into T-scores. For each sample member, the T-scores 
of the six divergent tests were summed to obtain the Divergent 


Composite score. The same procedure was used to obtain the Con- 


vergent Composite score. 

The Verbal and Quantitative scores of the Scholastic Aptitude 
Test were obtained from the sample members’ college admission 
records. 

Results. As indicated in Table 1, the reliabilities for the divergent 
tests ranged from .45 to .73 with their composite showing à reliabil- 


ity coefficient of .79. 
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TABLE 1 


Reliabilities of Six Divergent Tests and the Divergent Composite and of Siz 
Convergent Tests and the Convergent Composite (N = 165) 


Divergent Convergent 

Test ти "Test тп 
Alternate Uses .67* Word Grouping 548 
Possible Jobs .60* Sequential Association 468 
Expressional Fluency 67" Picture Arrangement 51s 
Ideational Fluency I NE Picture-Group Naming 64> 
Plot Titles (clever) „45% Gestalt Transformation .67* 
Associational Fluency I 47% Inventive Opposites +828 
Divergent Composite 79° Convergent Composite 80° 

Note.—ri: corrected by Spearman-Brown formula, 

* Alternate form estimate. 
^ Split-half estimate. 


* Computed with Mosier formula. 


This reliability estimate is slightly lower than the reliabilities of the 
convergent tests which ranged from .46 to .82 with a composite re- 
liability coefficient of .80. Although the reliabilities of separate tests 
are relatively low, the reliabilities of the composite are rather high. 

The matrix of intercorrelations among the convergent and the 
divergent tests are presented in Table 2 and 3. 

The correlation coefficients indicate that the six tests of each opera- 
tion do, to some extent, measure the same factors. 

On the other hand, as indicated in Table 4, the intercorrelations 
between convergent and divergent tests are largely non-significant. 
Thus, it may be concluded that, to a greater extent, the divergent 
and convergent tests are measuring independent factors. 

As it is shown in Table 5, the correlation coefficient between the 


TABLE 2 


Intercorrelations among Siz Convergent Tests and Their 
Convergent Composite (N = 165) 


Test 2 3 4 5 6 7 

1. Word Groupi 

Г ping 06 .03 . 1 -23* .54* 
2. Sequential Association .24* Ур К T 19. 2+ 
3. Picture Arrangement .05 18*  24*  .56* 
4. Picture-Group Naming .95*  .45*  .62* 
5. Gestalt Transformation 21* 53* 
6, Inventive Opposites , ‘e7* 
7. Convergent Composite Г 


Significant at .05 level. 
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TABLE 3 


Intercorrelations among Siz Divergent Tests and Their 
Divergent Composite (N = 165) 


Test 2 3 4 5 6 7 

1. Alternate Uses —.09  .31* .44* .03 .20*  .55* 
2. Possible Jobs .08 .15 .26*  .10 .38* 
3. Expressional Fluency 187%  .08  .30* 047 
4. Ideational Fluency I .12 208) av OP 
5. Plot Titles (clever) .44*  ,50* 
6. Associational Fluency I .66* 
7. Divergent Composite 


* Significant at .05 level. 


Convergent Composite and the Divergent Composite is 24, а sig- 
nificant but, nevertheless, small value, indicating that the conver- 
gent and divergent tests, semantic in content, measure somewhat 
different abilities. 

As far as the hypotheses are concerned, Hypothesis 1 was par- 
tially supported as only two out of the six correlations between con- 
vergent tests and the V-SAT score are significantly larger than the 
correlation of the corresponding divergent tests with the V-SAT 
score as indicated in Table 6. 

Hypothesis 2 was untenable, as the correlation between the Con- 
vergent Composite with the V-SAT score did not differ significantly 
from the correlation of the Divergent Composite with the V-SAT 
score (See Table 6). 

Hypothesis 3 stated that the correlation of each of the six con- 
vergent tests would be higher with the V-SAT score than with the 
Q-SAT score. The results given in Table 7 show that this predicted 
relationship held only in two out of the six correlations. Thus, this 
hypothesis was only partially supported. 

Hypothesis 4 postulated that the correlation between Convergent 
Composite with the V-SAT score would be higher than with the Q- 
SAT score was supported, as the obtained # of 2.17 is significant at 
05 level as indicated in Table 7. 

Discussion. The main objective of this study was the concurrent 
validation of some divergent tests as developed by Guilford and as- 
sociates. The results indicated that the correlation of divergent and 
convergent composites of .24, although significant, is so low that it 
may be concluded that the composites probably measure somewhat 
different aptitudes. But it was also found that the convergent abil- 
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TABLE 5 


Intercorrelations among the Convergent Composite, Divergent 
Composite, and the Verbal and Quantitative Parts of 
the Scholastic Aptitude Test (N = 165) 


Variable 2 3 4 
1. Convergent Composite .24* .45* .28* 
2. Divergent Composite .31* .02 
3. V-SAT .29* 
4. Q-SAT 


* Significant at .05 level. 


ity, semantic in content, should not be equated with intelligence, or 
general scholastic aptitude. Lorge and Thorndike (1962) reported 
that the correlation between the Iowa Test of Educational Develop- 
ment composite and the verbal part of Lorge-Thorndike Intelli- 
gence Test was .88. Usually, correlation coefficients between the well 
validated intelligence tests are in the .70-.80 range, and thus it must 
be concluded that a correlation between the verbal part of the Scho- 
lastic Aptitude Test and the Convergent Composite of .45 tmplies 
that the six convergent tests from the Structure of Intellect measure 
' aptitudes that have but little relationship with the aptitudes 
measured by the conventional tests of general scholastic aptitude. 
On the other hand, the assumption that the content of Guilford’s 


TABLE 6 


Significance of the Difference between the Correlation of Each of Six 
Convergent Tests and of Their Composite with V-SAT Score 
and the Correlation of Each of Six Divergent Tests and of 
Their Composite with V-SAT Score (N = 165) 


Factor Test V-SAT t 
DMC Alternate Uses .06 1.97* 
NMC Word Grouping .20 
DMI Possible Jobs .03 1.04 
NMI Sequential Association 15 
DMS Expressional Fluency .24 .49 
NMS Picture Arrangement .28 
DMU Ideational Fluency I 10 1.12 
NMU Picture-Group Naming 21 
DMT Plot Titles (clever) .10 .57 
NMT Gestalt Transformation MES 
DMR Associational Fluency I .52 3.31* 
NMR Inventive Opposites .68 

Divergent Composite .91 1.55 


* Significant at .05 level. 
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TABLE 7 y 
Significance of the Difference between the Correlation Coefficient of Each of Siz 
Convergent Tests and of Their Composite with the V-SAT ‘Score and 
of Corresponding Coefficients with Q-SAT Score (N = 1 65) 


| 


Factor Test V-SAT Q-SAT i 
NMC Word Grouping flu .20 .30 1.13 
NMI Sequential Association .15 —.14 3.33* 
NMS Picture Arrangement .28 .97 1.08 
NMU Picture-Group Naming .21 .07 $ 1.51 
NMT Gestalt Transformation 04 | .16 1.34 
NMR Inventive Opposites .68 .24 6.59* 
Convergent Composite 45 .28 2.17* 


* Significant at .05 level. 


tests determines the sampling of aptitudes is justified. Although the 
correlation of the Convergent Composite is significant with both the 
Verbal and Quantitative parts of the Scholastic Aptitude Test, the 
t between the two coefficients is 2.17. This significant difference in 
the expected direction signifies that the semantic tests from Guil- 
ford's пое] have much more in common with the verbal part of the 
SAT than with its quantitative part. 

Summary. The object of this study was the concurrent valida- 
tion of six convergent tests and their composite, which were seman- 
tic in content. The subjects were 165 male and female college un- 
dergraduates. It was found that although there was a significant 
relationship between the composite of convergent test and the V- 
SAT score, the convergent tests measured somewhat different abil- 


ities from those measured by conventional tests of general scholastic 
aptitude. 


ifornia Psychology Laboratory Report, No. 24, 1960. 
Lorge, I. and Thorndike, В. L. The Lorge-Thorndike Intelligence 
in, 1962. 


Mosier, C. I. On the Reliability of a Weighted Composite. Psycho- 
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PREDICTING COLLEGE GRADES USING ACT DATA 


LEO MUNDAY 
American College Testing Program 
Towa City, Iowa 


Problem. The purpose of this study was to review the experience 
of ACT Research Services, with respect to the prediction of college 
grades. 


ACT Research Services: The American College Testing Program 
offers several research services at no cost to member colleges and 
universities. This study is based upon results from one of the two 
prediction services. In the Standard (Plan A) Research Service, 
ACT scores, high school grades, and, optionally, locally collected 
scores are studied as predictors of grades in specific college courses 
as well as of overall college grade point average (GPA). Colleges 
may study the various predictors and criteria for as many as nine 
different subgroups corresponding to academic departments, stu- 
dent residence, high school attended, or any other subgrouping 
design desired by the college. In one subgroup, or by combining 
subgroups, a college develops a Summary Analysis representative 
of its freshman class. 

Each year several hundred colleges and universities participate in 
these Research Services. Though the prediction of college grades is 
not the most important outcome of these reports, prediction is an 
important part of these services. Regression coefficients, grade ex- 
pectancies, and percentile ranks of predicted grades, developed by 
ACT as a part of specific colleges’ Research Service participation, 
are stored on magnetic tape and are used to provide predictive in- 


1 Another ACT Research Service also provides predictive information. The 
Basic Plan B Research Service provides analyses only for overall GPA. 
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formation on score reports for subsequent ACT-tested students who 
request that their scores be sent to these colleges. 

Description of predictors. Two sets of predictors were used, ACT 
test scores and recent high school grades. The four ACT tests are 
designed to measure general educational development in the subject 
matter areas of English, Mathematics, Social Studies, and Natural 
Science. The tests, which are intended as indices of academic poten- 
tial for college-bound students, are used by high school and college 
educators for guidance, college admissions, scholarship selection, 
placement in college courses, and educational and vocational coun- 
seling. | 

Four self-reported high school grades in the areas covered by the 
ACT tests are routinely collected at the time a student writes the 
ACT examination. The student is asked to report his most recent 
grade prior to his senior year in high school in each of the four sub- 
ject areas: English, Mathematics, Social Studies, and Natural Sci- 
ence, Students have been found to report these grades with a high 
degree of accuracy (American College Testing Program, 1965). 

The T Correlation (T Index) is the multiple R resulting from 
optimally weighting the four ACT tests. The H correlation (H In- 
dex) is the multiple В derived from optimally weighting the four 
high school grades. The TH correlation (TH Index) is developed by 
averaging the GPA predictions made by the optimal weighting of 
tests and those made by optimally weighting high school grades. 
The TH Index has been found to be similar to an eight-variable 
multiple regression equation, and is the index ACT recommends 
colleges use as their best estimate of the relationship between the 
ACT record and college grades on their campuses. 

Criterion measures. Criterion measures were college grade point 
averages (GPA's) reported by the colleges and universities. Though 
colleges were free to select any specific courses they desired for 
study, most institutions chose course grades in English, Mathematics, 
Social Studies, Natural Sciences, and, of course, overall grade 
point average. Only these criteria were included for this review. 

Sample. The sample consisted of 398 colleges and universities 
that were in the ACT Research Service in 1963, 1964, or 1965. Most 
institutions were in the Research Service two years and a great 
number were in all three years. This general rule was followed: if a 
college was in the Research Service more than one of these three 
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years, data for only the most recent year in which they participated 
were included. 

Statisties were based on colleges’ Summary Analyses. Colleges 
were instructed to make their Summary Analyses representative of 
their freshman classes. This is particularly important because the 
predictions made on score reports for future ACT-tested students, 
based on Research Service Summary Analyses for the most part, re- 
quire that student samples adequately represent a college's fresh- 
man class. ACT recommends that all freshmen be included, if pos- 
Sible, and if not possible that a random sample of freshmen be 
selected. There is every reason to believe that these samples are 
representative of their respective college freshman classes. 


TABLE 1 А, 
Means and Standard Deviations for ACT Scores, High School Grades, 
and College Grades 


Means 
eens eS a awe aE und У 0 Е 
Mathe- Social Natural 

Variables English matics Studies Science Overalls 
— aee a PM 
ACT Scores 19.0 19.6 20.5 20.6 20.1 
High School Grades 2.70 2.44 2.80 2.54 2.02 
College Grades 2.02 1.98 1.97 1.93 2.09 

Standard Deviations 

ACT Scores 4.85 6.21 5.89 5.98 4.80 
High School Grades .87 .88 8 
College Grades 97 1.16 1.00 1.06 .81 
Number for whom 
ud ar were 
available 
Students 176,779 76,089 128,201 112 ,038 211,324 
Colleges 379 249 337 315 398 
Number for whom all 
ACT scores and high 
school grades were 
available 
Students 213,707 
Colleges 399 


* For ACT scores, this refers to the Composite. For high school grades, this refers to the average 
of the 4 self-reported grades. 
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Colleges were required to have at least 100 complete student rec- 
ords for a given criterion in order for analyses to be completed. The 
typical sample size was 500 for overall grade point average. 

Results. Means and standard deviations for ACT scores, high 
school grades, and college grades are given in Table 1. The data for 
ACT scores suggest that these students are similar to all ACT col- 
lege-bound students. 

Median multiple correlation coefficients for the T Index, the H 
Index, and the TH Index, including R’s and standard errors of esti- 
mate (SH-EST), are given in Table 2. Information is included for 
the four specific courses and overall GPA. Both ACT scores and the 
self-reported high school grades possess useful predictive validity, 
and the increase in predictive efficiency shown by the TH Index 
shows that scores and high school grades supplement each other as 
indicators of academic potential. 

A percentile rank frequency distribution of TH-R’s is given in 
Table 3. Correlation coefficients are included for the four specific 
courses and overall GPA. The considerable variability among the 


TABLE 2 


Median Multiple R’s and Standard Errors of Estimate for Predicting College 
Grades, Using Tests (T Index), High School Grades (H Index), 


and the Combination (TH Index) 
Median Multiple R’s for Four Specific 
College Courses and Overall 
Jj Mathe- Social ^ Natural 
Index English matics Studies ^ Science Overall 
T Index .508 .420 .496 .486 523 
H Index .508 .441 .485 .494 .558 
TH Index .593 .521 .578 .575 .627 
Median Standard Errors of Estimate for Four 
Specific College Courses & Overall 
T Index .793 1.020 .839 .896 .647 
Н Index .786 1.007 .843 .893 .632 
TH Index .729 .952 .782 .832 .582 


Number of Students 176,779 76,039 128,201 112,638 211,324 
Number of Colleges 379 249 337 315 398 
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TH-R’s, as shown by this table, implies that (1) predictive valid- 
ity should be established separately for each college and university, 
and (2) validity should not be summarized as merely one correla- 
tion coefficient, as this results in considerable loss of information. 


REFERENCE 


American College Testing Program. ACT Technical Report. Iowa 
City, Iowa: Author, 1965. 
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A COMPARISON OF THE CRITERION-RELATED 
VALIDITIES OF THREE COLLEGE ENTRANCE 
EXAMINATIONS WITH DIFFERENT 
CONTENT EMPHASES 


v 
WAYNE 8. ZIMMERMAN 
California State College, Los Angeles 
„AND 
WILLIAM B. MICHAEL 
University of California, Santa Barbara 


In the state colleges of California both the Scholastic Aptitude 
Test (SAT) of the College Entrance Examination Board (CEEB) 
and the American College Test (ACT) have been used in conjunc- 
tion with measures of high school achievement as a basis for selec- 
tion of freshmen as well as certain transfer students. Although fa- 
miliar problems of restriction of range in talent of the population of 
state college students (who customarily place in about the top third 
of high school graduates with the largest proportions falling in the 
top ten to top thirty per cent of the high school student bodies), of 
the unrealiability of the grade point average, and of the hetero- 
geneity of certain freshman student bodies (as in the instance of 
relatively high proportions of students who are working either part- 
or full-time and are commuting long distances) have been associ- 
ated with relatively low validity coefficients of standardized apti- 
tude and achievement tests, it was thought informative to compare 
the criterion-related validity of three selection devices: (1) the 
ACT, consisting of four subtests in English (E), Mathematics (M), 
Social Studies (SS), and Natural Sciences (NS); (2) the SAT of 
the CEEB, containing both a Verbal (V) and Mathematics (M) 
section; and (3) a short form of the Verbal Comprehension (VC) 
subtest and the regular form of the General Reasoning (GR) sub- 
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test of the Guilford-Zimmerman Aptitude Survey (GZAS) (the GR 
having been administered with a curtailed time limit). Although the 
distinction is only relative and not clear-cut, these three sets of tests 
tend respectively to constitute (1) relatively lengthy measures of 
developed abilities in whieh language and mathematics skills as 
well as other fairly specific components of achievement are reflected 
rather substantially, (2) lengthy measures of scholastic aptitude in 
which relatively general indicators of verbal and quantitative abil- 
ities are represented, and (3) short aptitude measures in which 
verbal and quantitative reasoning abilities are rather briefly sam- 
pled. 

Purpose. The two-fold purpose of this investigation was (1) to 
compare the coefficients of validity of each of the subtests of the 
ACT and SAT for a sample of freshmen entering the California 
State College, Los Angeles, (CSCLA) in the fall of 1962 and of 
subtests of the ACT, SAT, and GZAS for another sample of fresh- 
men enrolling at CSCLA in the fall of 1963 relative to the criterion 
of grade point average (GPA) earned in courses taken in each of 
twelve different departments of the college, and (2) to determine 
for the 1963 class the extent to which an optimally weighted com- 
bination of either one of two indices of high school achievement 
(either the GPA earned during the last three years of high school— 
excluding physical education and military science—or the number 
of recommended units of A or B work during the same last three 
years of high school) and subtests of the ACT, SAT or GZAS would 
exceed that of either one of the single high school indices of achieve- 
ment. It was hypothesized that (1) few if any substantial differ- 
ences would exist among the validity coefficients of any one of the 
subtests, and (2) the ACT, SAT, and GZAS measures in combina- 
tion with either one of the high school indices would yield higher 
coefficients of validity than would any single predictor—a high 
school achievement index or a subtest measure. 

Samples. For the 1962 freshman class who were given the option 
of submitting either the ACT or SAT Examination, subsamples 
were taken of all students who had completed either one of these 
scales and had pursued one or more courses in the twelve depart- 
ments enumerated in Table 1. In the instance of the 1963 freshman 
class arrangements were made during the freshman orientation in 
September to give the ACT or SAT to anyone who had not had the 
test previously and to administer during two scheduled testing ses- 
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sions the VC and GR subtests of the GZAS to all students who ap- 
peared. The short form of the VC was administered in ten minutes 
instead of the recommended 12 minutes, and the GR form was re- 
duced from 35 to 15 minutes in its administration. Of the 1037 stu- 
dents who participated in testing complete data were obtained for 
823. 

Statistical procedures. For both the 1962 and 1963 samples zero- 
order coefficients were found with respect to each of the subtests of 
the ACT and SAT, and in the instance of the 1963 sample, for the 
GZAS relative to each of the criterion variables cited in Table 1. 
For variable numbers of students in the pool of 1037 subjects in the 
1963 freshman class, zero-order coefficients of correlation were ob- 
tained between each one of the two measures of high school achieve- 
ment in turn and each of the 15 criterion variables cited in Table 2. 
From the pool of 837 subjects for whom complete data were avail- 
able zero-order coefficients of correlation were found for the VC and 
GR subtests of the GZAS with each of the criterion measures cited 
in Table 1. From the pool of 823 subjects for whom complete test 
data were available coefficients of multiple correlation were deter- 
mined between each of the criterion variables and selected com- 
binations of the subtests in the ACT, SAT, and GZAS scales. Zero- 
order coefficients were obtained through using the WDCORR 
program on the IBM 7094 computer of the Western Data Process- 
ing Center at UCLA, and multiple correlation coefficients were 
derived from the BIMD 6 program at the same location.* 

Findings. Although the detailed results are to be seen in Tables 1 
and 2, some of the major findings may be summarized as follows: 

1. From the entries in Table 1 it is apparent that none of the 
zero-order coefficients of correlation could be judged as high. Spe- 
cifically, it may be noted that for the mathematics and chemistry 
courses the SAT-M and ACT-M tended to be the most valid predio- 
tors, and that for other courses no clear-cut superiority or con- 
sistency among the relatively low predictive coefficients was present 
for the various subtests of the ACT, SAT, or GZAS scales. 

2. From the coefficients reported in Table 2 the following major 
findings seem evident: 

a. The GPA earned in all high school subjects showed in every in- 


1 Grateful acknowledgment is made to Western Data Processing Center at 
UCLA for making available facilities and time for processing of the data re- 
ported. 
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Procedure. Three groups of new freshmen were tested: 

Group A 70 volunteers above the 35th centile on the ACT; 44 

completed the battery of tests. 

Group B 59 students in Counseling 101 between the 10th and 

35th centile on the ACT; 40 completed all the tests. 

Group C 37 students in the Basic 88 curriculum in the Ist decile 

of ACT; all completed the battery of tests. 

There is evidence that the results were not substantially influ- 
enced by the attrition in the samples. 

The instruments used were: 

1. The Structured-Objective Rorschach Test (SORT). 

Thirty different trait standard scores may be obtained by 
this test and 15 raw scores. 

The Wechsler Adult Intelligence Scale (WAIS). Adult I, 
Verbal Scale only. 

Administered here as a group test, this instrument needs to 
be validated as a group test. 

3. The American Home Scale. 

` А socioeconomic scale. Five scores are obtained. 

- The American College Test (ACT). а 

Grade Point Average (GPA). 

Two GPA's were used—the GPA and the corrected GPA. 
The latter was calculated by eliminating sub-college re- 
medial course grades and marginal academic college course 
grades. (i.e. Art, Business Mathematics). 

To treat these data a null-hypothesis was formed that the differ- 
ences in mean scores on measures of personality, socioeconomic 
background, intelligence, and GPA in these groups were attributed 
to random errors, The three groups were compared through the use 
of nonparametric tests and the t-test. 

| A second treatment of these data was the calculation of two mul- 
tiple regression equations. Hach variable was correlated with all the 
other variables and both the GPA and the corrected GPA used as 
the criteria for the two Tegression equations. Group C was elimi- 
nated from this operation, since their GPA’s were not comparable 
with that of Group A and Group B. Group A and Group B data 
were combined. 

Results—Part I. 1. Group A and Group B, Group B and Group C, 
and Group A and Group C were compared with respect to mean 
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scores on five parts of the American Home Scale through use of the 
Kolmogorov-Smirnov test. No significant, difference was found in 
the three groups on the five measure of socioeconomie status of this 
scale. It appears from these data that these three groups came from 
very similar socioeconomic background. 

2. SORT standard scores of Group A and Group B on 30 traits were 
compared through using the Kolmogorov-Smirnov test. Thirty dif- 
ferent calculations were made using the standard scores. 

The only three traits that appeared to show significant differ- 
ences are: 


Highest * 
SORT Trait Significance Mean 
Dd.—Pedantic .001 Group B 
H.—Human Relationship .05 Group B 
O:P—Conformity .01 Group A 


3. No signifieant differences were found between Group B and 
Group C SORT standard scores when the Kolmogorov-Smirnoy test 
was employed. However, when Group А and Group С were tested, 
a significant differenee at the .05 level was found on trait O:P Con- 
formity with Group A having the higher mean. 

4. For the SORT variables, tests of significance were calculated 
for underachievers and overachievers. An underachiever for this 
step was a member of Group A who earned less than а “С” average м 
and an overachiever was an individual in Group B who earned 
more than a “С” average. 

Fifteen subjects of Group A with the lowest GPA’s were tested 
against 15 subjects of Group B with the highest GPA’s using the 
Mann-Whitney U-test. Fifteen SORT raw scores traits were com- 
pared. The following traits showed significant differences in the 


two groups. 
NER. оа ед 
Significance Higher 
SORT Level Mean И 
W.—Theoretical .05 Underachievers 
Dd.—Pedantic .05 Underachievers 
O:P—Conformity 05 Overachievers 


С Conformity р. 

SORT traits Dd Pedantic and O:P Conformity as in the second 
result again appear to be relevant factors in academic perform- 
ances. 
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5. Coefficients of correlation were calculated for all the variables 
—each one against the remaining variables. Two sets of calculations 
were made on the IBM 1401-7074 combination at the Chicago 
Board of Education Data Processing Bureau. 

These 2925 resultant r's were analyzed and the most significant 
T's relevant to this study were abstracted and they are summarized 
below. 

а. The r’s of the ACT subtest and the GPA’s ranged from .22 

to .36. (See Table 1.) This range of coefficient suggests that 
ACT as a predictor of uncorrected GPA’s in the population 
under consideration was of limited utility. 

b. The r's of the ACT subtests and the corrected GPA’s 
ranged from .32 to 45, (See Table 1.) Social Studies ex- 
hibited the highest г of .45 with corrected GPA. In this 
investigation ACT Social Studies appeared to be a more 
valid predictor of corrected GPA than any of the other ACT 
subtests. 

. The r's of the WAIS subtests and the GPA’s ranged from .28 
to .51. The subtests of WAIS were not so predictive of GPA 
as the total Verbal WAIS appeared to be. The Total Verbal 
Scale had the highest r, .51 with GPA (See Table 1). 

d. The r's of the WAIS subtests and the corrected GPA’s 


се 


TABLE 1 
Intercorrelation of WAIS Raw Scores and ACT Standard Scores and GPA (N = 84) 
ACT Standard Scores 
WAIS Soc. Nat. Com- Corr. 
Raw Scores English Math Studies Sci. posite GPA GPA 

Information. 482^ — .460* — .430* — .415* — .512* t yx 
Comprehension Бе — 1240» — 449 — lages Мз» 2268, 303%% 
Arithmetic 13» — .021* — .399* — .421* — .500* — .333** 26% 
Similarities | .524* — .460* — .5615* — .370* — 1576 .449* 370* 
Digit Span -406*  .358* — .315**  .293** 301% "ово ‘360% 
Vocabulary -573* — .809* — .502* — .443* 560 378 `406* 
"Total Verbal 049* — .619*  .6017* — .544* — .700* бп» 4g5* — 
Total IQ 670% — .600*  .620* — 1546" — 714^ "лок `471* 
GP. 306* .357* 347** 
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TABLE 2 
Intercorrelation of WAIS Scaled Scores and ACT Standard Scores and GPA's (N = 84) 
ACT Standard Scores 
WAIS Soc. Nat. Com- Corr. 
Sealed Scores English Math Studies Sci. posite GPA 
Information .478* .405* .443* .415* .509* .494** 
.531° .520° .497* .465* „541° 
Comprehension .529* .228» .434* .308** .425* .293* E 
.631° .273° .535* .379° .497* 
Arithmetic .236> .626*  .394* .414**  .498*  .922^ . 
.281* ‚751° .486° ‚511° 583° 
Similarities .506*  .474* .507* .365* .570* .435*  . 
M 574° .540° ‚59 . .440° ‚627° 
Digit Span /499*  .339*  .310**  .315**  .399* .980*  . 
.540° „406 .401° „408° ‚478° 
Vocabulary .553* | .883* .511* .A44* .558* .9700* —. 
.605 „491 .576° .500* 587° 
Total Verbal .649* .013*  .017* .544* .706* 161196708 
Total IQ .670* .650* .626* .546* „714* .490**  . 
.125* -715° .697• .603* .748° 
GPA .298* .300* .357* 224» .347** 
Corr. GPA .323* 344% .454* .941** .427*%* 


* ‚001 level of significance. 
** 01 level of significance. 
^ ,02 level of significance, 

b ‚05 level of significance. 
* Corrected for attenuation. 


е 


ranged from .27 to 49. (See Table 1.) The Total Verbal 
Scale showed the highest r of .49, with corrected GPA. 


. The r's between WAIS and ACT are given in Tables 1 and 


2. It can be said tentatively that WAIS used as a group test 
related significantly to the ACT. The correlation between 
WAIS IQ and the ACT subtests ranged from .60 to .75. The 
r between WAIS IQ and the ACT composite score was 


found to be .75. 


. The SORT attributes correlations with the two GPA's were 


very small, and the significant r's varied from 28 to 88. 
(See Tables 3 and 4.) Anxiety registered a negative correla- 
tion with GPA. All the other r's of SORT scores with GPA's 
were positive. 

Group С (Basic 88's) WAIS scores were so uniformly low 
that it was decided to exclude them from further statistical 
consideration. Further, in a follow-up of this group after 


TABLE 3 
Intercorrelations of SORT, WAIS Verbal and ACT (N = 84) 


—щ—ж о | 


Wechsler SS 
s . ACT 88 
Comp Arith Sim Digit Voc Total Тош 
Sort 88 1 2 3 4 5 6 vs 19 Eng Маһ 83 NS Comp GPA GPA E 
1. Theoretical w 285° 206 173 эт on 186 ap 251» M6 323» — 083 14 86 лоз 053 а 
2. Practical D —098 072 026 020 ля — 004 050 027 °з 066 M9 081 мо —.061 — 026 á 
3. Pedantie Dd —.327** —.108 —.208  —362 -075  —312** —310** —308** —322^ —.202 -150 -—-307* -31 X —118 —409 5 
4. Induction 200  347** 220 — 300% от 200 — 309 374° эт 386% эз 174 3019 25 MP я 
5. Deduetion 080 .347** — 198 320° 203 2139 saree  .310* — 289 — 237% 2455 168 330-9 18565 — 250 > 
6. Rigidity 8 3200  —070 2000 —.006 350%% —037 207 273 421 22» ли 470 A88 Q2 1123 
7. Structuring Р 47  —008 .180 124 176 55 92 187 on 202 203 043 106 000 2028 
8. Concentration 3291** 097 196 336% .183 345 3279 328% 290% 281. п> лез 321% 117 125 2 
9. Range 212 33» — 229 205 198 42 30 303%% — 2598 130 132 ATP ne з 103 
10. Human Relat. Н .100 .108 лаз 165 061  —.000 163 155 °з їп -o0  —131 -0ю 261 .178 
11. Popular P 2 — .323** 110 283° ли 240 золе soe — 207 — 305 109 206 ae м» 205 
12. Original о -— —1600  —4025 -000 -00 —181 —151 -140 —14l  —165 -09 -07 -1м хиз 060 
13. Persistence 8 26  —076 200 —006 350° —037 207 їз a зә» ли A70 88 202 123 
14. Aggressiveness 310% 2%  .320** 400% 230% .314** 442. A38* 302% 34» 410. — .100 373* 210 200 
18. Boe. Respons. 23» — 3179 22% 320% 113 м9 320**  316* — 230) 1160 188 4904 196 r JE 
16. Cooperation 004 581 -01 -0й  —129 -456 -0 -058 407 008 -mi -9и -о0 291 —.151 
17. Tact 006 -oa эм» 028 049 025 1107 4008 223 n 405 хаз эю AM -—005 
18. Confidence m m а» — 312 .109 160 3600 — 257 197 187 m 119 274 мз 192 E 
19. Consist. Behav. зї» -037 320» 14 21% .133 mz» ne 450 м» zr оз 22% A08 зз 
30. Anxiety ECH -. -2%®  —304  —25 —071  —25» -—27» —391* —18 -00 -22) -04 фм:  —29» -п> 
31. Moodiness —3298** —154  —280** —363* —.181  —349* —375 -—390  -319** —297* —312* —197 —337* -164 —-14 
22. Act. Pot. м д sm 191 287% їз зи 235% 337 306% 2228 2579 ә 2949 27 su” 
23. Impulsiveness -415  —M1  -—339* —200  —376* —320** —330* -25  —187 -25 121 -ю» -14 -—JMT 
94. Flexibility 26» ме — 337 185 As ms 331" — 22» — 156 273 D4 E э» me 
25. Conformity 320% 107 205 083 213 п» — 299 эъ эм ш E am мз 267 
26. F- -180  —125  —421* —067  —4289* —235% —351* -—A417* -2P -35 -27» 373° -20 -7 
27. FM 294 фа її -018 027 эб 052 48 26 A23  —404 хт хо 19 
28 FC 094 064 017 4M -06 050 хоз -æ -5 -30 -оп -1% ©з 40 
30. CF £2 -—115 ou -2% -16 -15 127 2206 Л жо 208 х2 91 -o 
0. A -1231 -140  —100  —4078  —)19  —176 ләт ——177 -ой 066 103 -45 -02 023 
* 001 level of significance. 
** 01 bevel of significance. 
* 02 level of significance. 


ACT Standard Scores 


TABLE 4 


Intercorrdation of SORT, WAIS Verbal and ACT (N = 84) 
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four terms, only one student was in a regular college curric- 
ulum. This student had a WAIS IQ of 120. 
№. With respect to mean WAIS Vocabulary raw scores, t-tests 
yielded the following results: 
Group A vs. Group B: difference significant at .01 level. 
Group B vs. Group C: difference significant at .001 level. 
When t-tests were applied to WAIS IQ’s of the three groups, 
almost exactly the same results were obtained. 
Results—Part II. Two regression equations were developed with 
one using GPA as a criterion and the other using corrected GPA. 
The ACT scores, WAIS scores, SORT scores, and American Home 
Scale scores were entered into the formulas. The two final equations 
included one ACT factor—Social Studies and five SORT raw score 
factors: 


X,—ACT—Social Studies X,—SORT 3—Dd, Pedantic 

X,—SORT 1—W, Theoretical X;—SORT 12—P, Popular 

Xs—SORT 2—D, Practical X,—SORT 13—0, Original 
All the other variables did not appear to be highly related to the 
predicted GPA. The final equations were: 


GPA = 2.359 + .047 (X,) — .050 (X;) 
— .044 (X) — .043 (X,) + .056 (X,) 
+ .081 (X;) 
Corrected GPA = 1.896 + .067 (Х,) — .055 (Xa) — .052 (X;) 


— .038 (Х.) + .065 (Xs) + .0885 (X,) 

"Through using these equations, the predicted GPA and the pre- 
dicted corrected GPA were calculated for each of the 84 subjects. 
Correlation coefficients were calculated between the predicted 
GPA’s and the actual GPA’s for the group. These r’s were: 


1, GPA Predicted vs. Actual = .47 


Standard Error of Estimate = .62 
2. Corrected GPA Predicted vs. Actual = .58 


Standard Error of Estimate = .60 
Summary and conclusions. On the basis of the sample data it ap- 
peared that a multiple regression equation may be fruitful in pre- 
dicting academic success at Wright College when SORT and ACT 
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variables are substituted into the equations. However, additional 
studies are needed for cross-validation purposes. The sample used 
in this study was small and the criterion employed (overall GPA) 
was probably somewhat unreliable. Lavin (1965) pointed out that 
a probable cause of low correlation between grades and predictors 
might be due to uncontrolled variations in the grades themselves. 
He indicated that the two principal sources of error variance were 
(a) that not all students take the same courses and (b) that differ- 
ent teachers use different criteria in grading students. Despite these 
limitations, the predicted corrected GPA regression equation 
yielded a greater r with actual corrected GPA than did any single 
ACT or WAIS measure. (See Table 5.) 

It appears from the data from the multiple regression equations 
that five Rorschach traits or what may be tentatively inferred to 
be five traits of personality are significantly related to overall GPA 
when the ACT Social Studies standard scores are taken into ac- 
count. 

Further, the ACT subtests, except for Social Studies, appar- 
ently were not significantly associated with overall GPA in the 
group studied. In his unpublished study of Social Science 101 stu- 
dents, Reinfranck (1962) found that students with an ACT Social 
Studies standard score of 17 or below (transposed from SCAT) 
rarely earned higher than a “D” grade in Social Science 101. 


TABLE 5 
Coefficients of Correlation of WAIS and Corrected GPA and 
yen ACT and Corrected GPA 
oo 
ACT 
WAIS Scaled Corrected Standard Corrected 

Scores GPA Scores GPA 
Information .334** English .323* 
Comprehension .278* Math .344* 
Arithmetic 260° Social Studies .454* 
Similarities .378* Nat. Sci. .941** 
Digit Span .358* Composite .421** 
"Vocabulary .412* 
"Total Verbal .485* 
Total IQ .471* 


Note The у of actual corrected GPA and predicted corrected GPA using tho multiple re- 
gression equation is .579. 
* .001 level of significance. 
жж 01 level of significance, 
1021 of significance. 
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Using nonparametric tests, the writers found that Group A and 
the overachieving group had a significantly lower mean in SORT 
Pedantic (Dd) and a significantly higher mean in SORT Con- 
formity (O:P) than did Group B or the underachieving group. 
Many studies, as cited by Lavin (1965), have been reported that 
used a variety of Rorschach techniques to relate Rorschach traits 
to academic achievement with primarily negative results. However, 
Erb (1961) in a test equated with SORT Conformity, found that 
females who were higher in Conformity achieved more (.01 level of 
confidence) than females low in Conformity, but he found no dif- 
ference in males high or low in Conformity with respect to their 
academic achievement. Schmeidler, Nelson and Bristol (1959) 
found “Rorschach ratings (good, fair and poor) showed significant 
or suggestive relations . . . in the anticipated direction to academic 
honors and probation.” The above studies and our findings indicate 
that Rorschach measures may be useful in identifying potentially 
good students. 

To reduce some of the chance factors influencing GPA, it is sug- 
gested that a large group of students, perhaps 400 to 500, be given 
SORT and two regression equations to be developed, one for each 
course, Social Science 101 and Biology 101. The final grade in each 
course would be the criterion for the equation developed from that 
course. The final grades in these two courses are primarily depart- 
mentally determined and they are probably not subject to as many 
varied influences as an individual teacher’s grades. 

The depth factors of SORT need further study since their con- 
stellation in the equations are not amenable to breakdown into in- 
dividual study. It is hoped that a larger sample will make it possi- 
ble to deal more extensively with these personality factors. 
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THE PREDICTIVE VALIDITY OF A BATTERY OF 
MEASURES FOR EACH OF FOUR DIFFERENT CLASSES 
OF A MEDICAL SCHOOL? 


SEYMOUR POLLACK 
University of Southern California School of Medicine 
AND 
WILLIAM B. MICHAEL 
University of California, Santa Barbara 


Problem. For samples of fifty-nine freshmen, fifty-five sbpho- 
mores, forty juniors, and thirty-three seniors out of classes usually 
numbering between sixty and sixty-nine students at the University 
of Southern California School of Medicine, the purpose of this in- 
vestigation was to determine the criterion-related validity of each 
of twelve measures representing scholastic aptitude, achievement 
in premedical college courses, and attitudes toward psychological 
aspects of the doctor image and the doctor-patient relationship as 
perceived by students. In addition to ascertaining the validity of 
several types of predictors it was also thought informative to de- 
termine any discernible pattern of change in the strength of the 
coefficients of validity relative to the stage of progress of students 
in the medical school program. 

Predictors and criterion variable. The twelve measures which are 
cited in Table 1 are largely self explanatory. Unfamiliar to many 
readers may be Blum’s (1957) Patient Attitude Test (PAT) and 
the Doctors’ Opinion Questionnaire (DOQ), which the writers have 
described elsewhere (Pollack and Michael, 1965). Estimated by 
Blum (1957) to have internal consistency estimates of reliability of 


1 This research was supported by Research Grant No. МН-07366-01 from 
the National Institute of Mental Health to the University of Southern Cali- 
fornia School of Medicine. 
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TABLE 1 
Predictive Validity Coefficients of Each of Twelve Measures for Four Different Classes of 
Medical Students During the 1961-1962 Academic Year at the University of 
Southern California Medical School (All decimal points omitted) 


Criterion Variable of Class Rank 
Freshmen Sophomores Juniors Seniors 
Predictor Variables (М =59) (N=55) (N=40) (N=33) 
1. Patient Attitude Test—Scale 1 09 —06 01 —16 
2. Patient Attitude Test—Scale 2 11 05 54** —22 
3. Patient Attitude Test—Scale 3 —21 19 —06 —24 
4. Doctors' Opinion Questionnaire —03 04 05 20 
5. MCAT—Verbal —16 —08 22 08 
6. MCAT—Quantitative —01 01 08 00 
7. MCAT—Modern Society —31* 00 12 14 
8. MCAT—Science 14 —01 25 18 
9. Pre-Med Course Work—College GPA 31* 15 03 25 
10, Pre-Med Course Work—Required 
Science GPA 1 14 06 00 
11. Last Year of Academic Work | 
(Prior to Med School) GPA 34** —02 03 -0 3 
12. Median Rating of Application 
Committee —03 —07 02 49** 


ШОО ыр зт LS ШОТЫ АА REN co NNNM 


Significant at the .05 level. 
** Significant at the .01 level, 


.88, .89, .86, and .89 respectively, the three scales of the PAT and 
the DOQ instrument are correspondingly intended to sample atti- 
tudes of patterns toward doctors and of doctors toward patients in 
matters of medical practice. In the four samples of medical stu- 
dents investigated instructions were given to each group to respond 
to the items from their own point of view. 

Statistical treatment, Intercorrelation coefficients among the pre- 
dictors (which are not reported but available from the writers) and 
coefficients between each of the predictors and rank in class (the 
rank values being reversed to allow emergence of positive coeffi- 
cients) were calculated for each of the four classes. In view of the 
relatively low values of the validity coefficients and the smallness 
of the samples, it was decided th: 
would not be feasible, 

Results and discussion. Inspection of Table 1 reveals that (a) only 
two and three coefficients were statistically significant at .05 and 
01 levels, respectively, and (b) no consistent or reliable patterns of 
predictive validities could be identified with respect to the cross 
sectional samples investigated. Although the presence of a marked 


at multiple-regression analyses 
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restriction of range in each class (as evidenced by relatively small 
standard deviations and relatively high means on the Medical Col- 
lege Aptitude Test scales) and of a bias of experimental mortality 
particularly noticeable for the junior and senior classes could be 
offered as a partial explanation for the small magnitudes of the co- 
efficients obtained, a relatively more plausible hypothesis may well 
be that the student characteristics necessary for success in the med- 
ieal school were not tapped to any substantial degree by any of the 
predictors employed for these samples. It would also seem that dif- 
ferentiation of the single criterion measure into a number of reliably 
determined components of academic and clinically-oriented activi- 
ties might lead to somewhat larger validity coefficients than those 
realized. Furthermore, crossvalidation efforts with new samples of 
medical students might possibly reveal some degree of gain in the 
validities of certain of the predictors, although the promise of such 
improvements is not to be anticipated without efforts to obtain op- 
erationally stated and observable objectives of the multidimen- 
sional criterion of success—objectives that can be reliably measured 
as students progress in their four-year program. 


REFERENCES 
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PREDICTION OF FRESHMAN ACHIEVEMENT WITH A 
MODERATOR MODEL 


JOHN E. BOWERS 
Haile Sellassie I University? 


Purpose. This study compares the predictive validity of a multi- 
ple regression model, predicting first term grade point average from 
a weighted linear combination of higher school percentile rank and 
ACT composite score, with the validity of Saunders' (1956) “тод- 
erated" multiple regression model, predicting the same eriterion 
from these two predictors plus a weighted moderator variable, 
which is the product of high school percentile rank and ACT Com- 
posite score. Should the moderated model predict first term grades 
significantly better than the customary two-variable model, one 
concludes that high school percentile rank and ACT Composite 
score interact with the first term grade point average. In other 
words, the degrees of relationship between first term grade point 
average and high school percentile rank change at different ACT 
Composite score levels. 

Predictors. Three predictors were used: (1) high school percentile 
rank, defined аз № — В + % divided by N, where № equals the 
number of graduates ranked in each subject’s class, (2) the Com- 
posite score on the American College Test battery, and (3) the 
product of high school percentile rank and the ACT Composite 
score. 

Criterion. Overall University grade point average in graded 
courses taken during the first semester in attendance on the Cam- 
paign-Urbana campus of the University of Illinois was used as the 
criterion. 


Data reported herein collected while author was at the University of 
ois. 
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TABLE 1 
Multiple Correlations of First Semester Grade Point Average 
with Both Prediction Equations 
Prediction Multiple Correlations with GPA I for: 
Equation м 
Based Upon 1962 Freshmen 1963 Freshmen 
HSPR, ACT .5904 .5504 
НЕРВ, АСТ, 
Moderator 5954 .5586 


Subjects. Data were collected from the records of 4056 Semester 
I, 1962-1963 and 4283 Semester I, 1963-1964 beginning freshmen 
admitted to the Campaign-Urbana campus of the University of 
Illinois. 

Results. Multiple correlations between the first semester grade 
point average and the prediction equations associated with the two- 
variable and the moderated models are shown for the two freshman 
classes in Table 1. Table 2 summarizes F-tests of the hypothesis 
that the regression weight for the moderator variable is zero. The 
moderator model predicted first semester grade point average sig- 
nificantly better than did the model weighting high school percentile 
rank and ACT Composite score. This is obviously a function of the 
population sizes. Differences in the correlation of the two predic- 
tion equations were slight. 


TABLE 2 
Analyses of Variance Tests of the Significance of the Moderator Variable 


1962 Freshmen 1963 Freshmen 
Proportionate Proportionate 
Source of Sum of F Sum of Е 
Variation Squares df Ratio Squares df Ratio 
Differences between: 
Moderator Model and 
Two-Variable Model .0059 1 36.95 Ў 6.68 
Residual Variation .6455 4052 085 ат : 
REFERENCE 


Saunders, D. В. Moderator Variables in Prediction. EDUCATIONAL 


AND PSYCHOLOGICAL MEASUREMENT, 1956, 16, 209-222. 
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A TEST OF VARIATION IN GRADING STANDARDS 


JOHN E. BOWERS 
Haile Sellassie I University* 


Purpose. This investigation analyzes the relationships between 
first term grades and academic ability for two classes of beginning 
freshmen, admitted to the Champaign-Urbana campus of the Uni- 
versity of Illinois, in order to determine whether both groups were 
graded on comparable scales of evaluation. 

It is assumed that a change in the grading standards for the two 
groups is reflected in differences in their regression equations pre- 
dicting first semester grade point average from a weighted combina- 
tion of high school percentile rank and ACT Composite score. If 
the regression equations differ in slope, a test of intercept differences 
is not particularly meaningful; if the equations are parallel, a 
difference in intercepts implies variable grading standards, under 
the assumption that variation reflecting other effects, e.g., high 
school courses taken, high school grading standards, equivalence of 
test scales, college courses taken, are negligible. 

Subjects. Data were obtained from the records of 4283 beginning 
freshmen first-time enrolled for the fall 1963 semester, and 5132 be- 
ginning freshmen first-time enrolled for the fall 1964 semester. 

Predictors. Two predictors were used: (1) high school percentile 
rank, calculated in the customary manner, and (2) composite score 
on the American College Test battery. 

Criterion. The grade point average achieved in all graded courses 
completed during the first semester in attendance was defined as the 
index of grading. 

Results. Table 1 summarizes analysis of variance F-tests of com- 

1Data reported herein collected while author was at the University of 
Illinois. 
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TABLE 1 
F-Tests of the Hypothesis of Common Regressions 
and Intercepts 


F-Ratio Testing for 
Differences in: 
Regressions Intercepts 
College (df = 2) (df = 1) 
Agriculture 0.11 0.14 
Commerce 0.11 17.64** 
Education 3.48* = 
Liberal Arts (males) 2.13 0.49 
Liberal Arts (females) 10.11** — 
Fine Arts (males) 0.82 0.45 
Fine Arts (females) 0.27 0.03 
Physical Education 3.75* — 
Institute of Aviation 3.07 0.33 
Engineering 0.01 1.33 
* Significant at the .05 level. 
** Significant at the .01 level. 


mon regression (slopes) and intercepts for the linear regression 
equations predicting first semester grade point average from high 
School percentile rank and ACT Composite score for the fall 1963 
and the fall 1964 freshman groups. (Previous analyses showed sex 
differences in the prediction of first semester grade point average 
for beginning freshmen admitted to the College of Liberal Arts and 
Sciences and to the College of Fine and Applied Arts.) Significant 
differences at the .05 level were found between the slopes of the 
equations for the freshmen in the College of Education, the College 
of Physical Education, and for freshmen women in the College of 
Liberal Arts and Sciences. For these three groups, no intercept dif- 
ferences were tested. A significant difference was found in the in- 
tercepts of the equations for fall 1963 and fall 1964 freshmen in 
the College of Commerce and Business Administration; therefore, it 


might be concluded that grading standards changed in this college 
from the 1963 to the 1964 year. 
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OTIS IQ: FORECASTING EFFICIENCY FOR 
UNDERGRADUATE EDUCATION COURSES 


L. L. AINSWORTH AND A. M. FOX 
Sam Houston State College 


Problem. The purpose of this study was to determine the extent 
to which Otis Quick-Scoring Mental Abilities Test (OQS) 1Q’s may 
be useful for the prediction of grade-point ratios (GPR’s) in under- 
graduate Education courses. 

Sample. The sample consisted of 4,799 students at Sam Hotiston 
State College for whom OQS IQ's and at least one grade in an un- 
dergraduate Education course were available, through August, 1964. 

Procedure. GPR’s, mean GPR, mean IQ, the coefficient of corre- 
lation (r), regression equation, coefficient of alienation (k), index of 
forecasting efficiency (E), and the standard error of estimate (сут) 
were computed by an ІВМ 1620 Data Processing System. 

Results. Mean IQ for the 4,799 students was 111, with an SD of 
9.79. Mean GPR (4-point system) was 2.0, with SD 0.90, г equal to 
43, and the regression equation was predicted GPR = 0.04 (IQ) — 
2.17, with сут 0.82. E was found to be 9.7 per cent, and k to be 
0.90. 
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THE VALIDITY FOR NINTH GRADE ACHIEVEMENT 
OF THE SSAT AND OTHER ADMISSION CRITERIA 
AT A PRIVATE SECONDARY SCHOOL 


JAMES M. SCHUERGER 
Gilmour Academy 
Gates Mills, Ohio 
AND 
HENRY F. DIZNEY 
Kent State University 


Purpose. The primary purpose of this study was to consider the 
validity of the Secondary School Admissions Test (SSAT) (Edu- 
cational Testing Service, 1963) for the predietion of ninth-grade 
GPA at a private boy's school in northeastern Ohio. In addition, 
Otis IQ scores, previous school grades, and the scores on an untimed 
admission composition test were used separately or in combination 
to estimate relationships to specific course grades and grades in 
general. 

The composition grades, from A+ to F, were quantified on & 18- 
point scale, 12 high and 0 low. In order to obtain an estimate of the 
reliability of these grades, each composition was rescored by ran- 


dom redistribution to the same group of raters, six in all. Names of 
the applicants and previous ratings were removed. The two ratings 
based on five written criteria, were com- 


obtained in this manner, 
bined (range 0-24) for use in further computation, and the separate 


ratings were used to compute a Pearson product-moment correla- 
tion coefficient as an estimate of rater-reliability. The obtained r of 
46 was lower than expected but not unusually low for two raters, 
particularly for ratings from a selected population. For example, 
Godshalk, Swineford, and Coffman (1966) found reading reliabili- 
ties of .36 to .41 for single topics read once. 
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English, mathematics, and total grades from the eighth grade 
prior to admission were used as predictors. English, algebra, Latin, 
and total grade point average (GPA) from the ninth grade after 
admission were used as criteria. The total GPA from the ninth 
grade was later used as the criterion for multiple predictions. All 
grades were given numerical values from 0 to 12 representing letter 
grades from F to A+ respectively. 

Prior to 1965, SSAT scores were reported as vocabulary, reading 
speed, reading level, total reading, verbal, quantitative, and total 
score. These scores were in standard-score form with an approxi- 
mate range of 240 to 360 (Hducational Testing Service, 1963). In 
order to simplify interpretation, three of these scores, total reading, 
verbal, and quantitative were utilized as predictors in this study. 

Summary statistics for each of the 12 variables included are 
given at the foot of Table 1. 

Sample. The total number of cases in the study was 144, taken 
over a period of three academic years, 1962-63 to 1964-65. Subjects 
were eighth-grade boys applying for admission. They were mostly 
from higher socioeconomic classes, a great proportion of their fath- 
ers being professional or managerial. As will be noticed from Table 
1, they scored high on measures of academic ability; the average 
Otis IQ was 120, and the average verbal score on SSAT was 284. 

Procedure and results. Intercorrelations among all twelve varia- 
bles were computed and tested for significance (r > .165 significant 
at .05 level; r > .215 significant at .01 level). Table 1 gives these 
data. In addition, multiple correlations were generated, using ninth 
grade GPA as the criterion, with every possible combination of 
predictors. 

The best single predictors of ninth-grade GPA were eighth-grade 
GPA, Otis IQ, and the quantitative score on SSAT. A combination 
of verbal and quantitative scores on SSAT, a grouping logically 
comparable to Otis IQ, yielded a multiple correlation with the same 
criterion of .60. The composition score correlated with the ninth- 
grade GPA greater than two of the SSAT sub-scores but less than 
SSAT-Q, Otis 1Q, and each of the eighth-grade variables. 

In a similar study using data from SSAT, Pitcher (1962) re- 
ported correlations of from .41 to .68 between SSAT total and 
ninth-grade GPA at two independent schools. In the same study, 
previous school record correlated with ninth-grade average from .29 
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to .45. Multiple correlations using both predictors ranged from .46 
to .70. 

The best single test score as predictor of success in ninth-grade 
Latin was SSAT-Q (r = .39); the correlation obtained was lower 
than that from either eighth-grade mathematics or eighth-grade 
GPA (each r = 48), each being a typically available measure of 
prior performance. | 

While the best predictor of ninth-grade English was eighth-grade 
English (r — .54), with the admission composition measure slightly 
lower (r = .50), the best predictors of ninth-grade algebra were not 
previous grades but test scores, specifically Otis IQ (г = .46) and 
SSAT-Q (r = .45); the high intercorrelation (r = .68) between 
Otis IQ and SSAT-Q renders it unlikely that much could be gained 
by combining the two as predictors of success in mathematics. While 
there was relatively little difference in the efficacy of SSAT-V and 
Q as predictors of ninth-grade English (r = .47 and r = .44, respec- 
tively), Q was a significantly better predictor of ninth-grade alge- 
bra than was V. 

In Table 2, thirteen of the multiple correlations have been pre- 
sented, one in each column; all variables for which a beta coefficient 
is listed are included in the multiple with the criterion. The correla- 
tion coefficient is presented at the bottom of the column. Many 
other combinations yielded multiple correlation coefficients over .60, 
but those listed were selected because they were considered to be 
logical combinations. For example, since variables six and seven 
were logically redundant with eight, they were deleted from all the 
listed combinations except the first. Also, two or more combinations 
without each of the four distinct classes of variables (composition, 
SSAT, Otis IQ, and grades) were used. For example, equations 9, 
10 and 11 do not include the composition score. 

It may be noticed that there is relatively little difference in the 
value of E between the more complete and simpler sets of predic- 
tors. Using only Otis IQ and eight-grade GPA, a coefficient of .64 
was obtained; adding the composition grade increased the R to .65. 
Substituting V and Q from SSAT for the Otis IQ produced a coeffi- 
cient of .66, while composition, reading total, SSAT-Q, and eighth- 
grade GPA yielded the same result. In the two groupings without 
grades (equations 12 and 13), composition and Otis IQ correlated 
60 with the ninth-grade criterion; composition and two SSAT 
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scores, reading total and SSAT-Q, correlated slightly higher with 
this criterion (R = .64). 

Summary. Several useful multiple regression equations were de- 
veloped as an aid to prediction in admissions work. Traditional in- 
dices such as IQ and previous record were justified as predictors, 
although slight gains were observed using the more extensive SSAT 
either as an addition or as an alternative measure. 

Mathematics grades and SSAT-Q were found to be more highly 
correlated with both Latin grades and GPA than were various 
verbal scores. The composition grade was less reliable than had 
been hoped, although it correlated more highly with GPA than any 
other single verbal score did and added slight gains to the multiple 
correlations, 
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PREDICTING HIGH SCHOOL BIOLOGY ACHIEVEMENT 
WITH THE DIFFERENTIAL APTITUDE TESTS 
AND THE DAVIS READING TEST 


GEORGE P. HOLLENBECK 
The Psychological Corporation 


Tuis study presents validity data for the Differential Aptitude 
Tests (DAT) and the Davis Reading Test in predicting biology 
achievement of high school students. Also presented is a comparison 
of scores from alternate forms of the Davis Reading Test with a 
nine-month interval between administrations. The students were 
enrolled in elasses the curriculum in each of which was one of the 
three versions of the Biological Sciences Curriculum Study (BSCS) 
curriculum (Blue, Green, and Yellow) during the school year 1964— 
1965. 

Samples. All students were in the tenth grade. The three samples 
were made up of classes of students throughout the United States, 
but they were not selected systematically to make up & stratified 
sample. Each sample consisted of classes using one curriculum: 
Blue, N = 132; Green, N = 288; Yellow, N = 107. Only students 
for whom complete data (both pre- and post-) were available were 
used. Since some students can be expected to have dropped out dur- 
ing the year, the sample may be slightly restricted. 

Predictor variables. The DAT, Form L, and the Davis Reading 
Test, Form 2D, were administered during the first month of the 
school year. Three parts of the DAT were given: Verbal Reasoning 
(VR), Numerical Ability (NA), Abstract Reasoning (AR). The 
sum of VR and NA (VR + NA) was used as a fourth DAT scale, 

Criterion measures. The BSCS Comprehensive Final Examina- 
tion in First Year Biology, Form J, served as the measure of biology 
achievement. This test, described in detail in the test manual 


439 


(BSCS, 1966), is a 45-minute final achievement test designed by 
the BSCS, Although each of the three BSCS curricula versions ap- 
proaches first-year biology somewhat differently, the Comprehen- 
sive Final is designed for use with all three curricula. All students _ 
took the examination during the last month of the school year. 

Form 2A of the Davis Reading Test was given during the last 
school month also, and the Form 2D-2A comparison provides the _ 
data for alternate form correlations. | 

Descriptive statistics and validity data. The data are presented 
separately for each curriculum group, since the DAT means in- - 
dicated that the average ability of students in the three curricula 
were different. Means, standard deviations, and validity coefficients - 
are presented in Table 1 for each of the curriculum groups. The pre- 
dictor-criterion intercorrelations are presented in Table 2. All со: 
relations are Pearson product-moment rs. 

DAT validity. Of the DAT scales, the combined Verbal Reason- 
ing and Numerical Ability (VR -+ NA) scale was the best pre- 
dictor. The average correlation for the three groups is .63. This i 
similar to the average DAT(VR -+ NA) correlation with the Co! 
prehensive Final reported in the DAT manual (Bennett, et al., 1966). 
The Verbal Reasoning test alone was the next best predictor (av 
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TABLE 1 
Means, SD's, and Correlations with Achievement 


Curriculum Group 


Blue (N 2132) Green (№ 288) Yellow (N 2107 
M SD fr M 8р r м" SD " 


Verbal 
Reasoning 38.8 6.0 .00 2.5 9.6 . 
К 29.5 9.6 .64 30.4 8.9 .50 
bility 31.0 5.2 41 248 7.3 55 25.5 6.7 .49 
УВ + МА 69.8 9.8 б 542 18 18 . 
15.5 .65 55.8 13.8 61 
40.8 4.5 .88 86.4 7.5 .47 35.1 8.1 .39 


Davis Reading 
Form 2D (Pre) 60.5 10.8 .51 47.9 16.1 .59 511 153 .54 


62.1 11.5 — 50.8 15.8 — 55.4 14.2 — 
34.9 6.0 


23.8 6.7 — 28.6 69 — 
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ager = .61), followed by the Numerieal Ability test (average r = 
.51). The Abstract Reasoning test was consistently less predictive 
than VR or NA or the combination, (VR + NA). 

Davis Reading Test validity. The validities for the Davis Read- 
ing Test for the three groups were .51, .59, and .54, (average r = 
.56), somewhat less than the DAT (VR + NA) correlations. The 
high correlation between the Davis and the (VR + NA), Table 2, 
indicated that little gain in prediction would result from a combina- 
tion of them. The multiple R’s resulting from combining the DAT 
(VR -+ NA) and the Davis scores for the three curriculum groups 
were .62, .67, and .64 for the Blue, Green, and Yellow versions, re- 
spectively. 

Davis alternate form comparisons, Bach of the curriculum groups 
had higher means on Form 2A at the end of the year than on Form 
2D at the beginning. The average difference in raw scores was 2.8, 
ranging from 1.5 for the Blue group, to 43 for the Yellow group. 
The correlations between the two forms were .75, .82, and .72 for the 
Blue, Green, and Yellow groups. These correlation coefficients com- 
pare favorably with the two-week interform reliabilities reported in 
the Davis manual (Davis and Davis, 1962), even though the in- 


TABLE 2 
Intercorrelation of Tests for Each Group 
Criteria 
Predictors Davis Со 
DAT 33> NS УТ (Post-) 
VR 5 Blue 45 .88 .30 87 ‚60 60 

Green 60 94 60 .76 74 Ui 
Yellow 56 .92 46 .06 .00 „50 

2-МА Blue 83 45 Al .35 E 
Green 89 .58 .02 .06 55 
Yellow 83 42 33 51 49 

9-VR--NA Blue AT 55 .57 .60 
Green Q4 76 т .65 
Yellow A8 OA .09 61 

4-AR Blue 29 .18 .38 
Green 57 „56 “7 
Yellow A8 .33 .30 

сраза 

2) 
(Pre-) p Blue 15 51 

Green .82 E] 
Yellow 72 54 

ee 
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tertest interval in this study was much longer than those reported in 
the manual. 

Conclusions. Both the DAT and the Davis Reading Test had 
substantial validities for predicting first-year high school biology 
achievement. The DAT (VR - NA) had the highest validity of the 
individual DAT scales. As indicated by the high DAT-Davis corre- 
lations, little gain in validity resulted from combining the DAT 
(VR + NA) and Davis scales. Interform cdgrelations for the Davis 
with a nine-month interval were high. ы 
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x pese INTELLIGENCE, 
VOCABULARY 


BERT ү, WESTBROOK 4x» JAMES В. SELLERS 
North Carolina State University at Raleigh 


Problem. The purpose of this study was to determine the rela- 
tionships between critical thinking, intelligence, and vocabulary of 
а group of senior high school pupils in the Raleigh Public Schools. 

Method. the sample consisted of 411 pupils enrolled in one senior 
high school, grades 10-12. In September of 1965, all pupils in"the 
sample were administered the Watson-Glaser Critical Thinking 
Appraisal, Form Ym (WG), and the Quick Word Test (QWT). 
The Henmon-Nelson Tests of Mental Ability, Revised Edition 
(HN) had been administered at an earlier date, and the scores were 
taken from cumulative folders, The WG subtest raw scores (In- 
ference, Recognition of Assumptions, Deduction, Interpretation, 
and Evaluation of Arguments), HN IQs, WG total raw scores, and 
QWT raw scores were intercorrelated. The resulting correlation 
matrix was factor analyzed by the centroid method and four factors 
were rotated using the Varimax criterion for simple structure. 

Findings. The intercorrelations, means, and standard deviations 
are shown in Table 1. All correlations are significant at the .05 level. 
The WG subtest intercorrelations are .35 or less, with the exception 
of the correlation between WG Deduction and WG Interpretation 
(.52). The between the WG subtests and WG total raw 
scores .52 for WG Interpretation to .77 for WG Dedue- 
tion. The highest total test correlation was the .57 found between 
QWT and HN. The correlation between WG and QWT was 44. The 
correlation of .53 between WG and HN is lower than the range of 
correlations (.55 — .75 with а median of .68) reported by Watson 
and Glaser (1964). 
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TABLE 1 
Intercorrelations, Means, and Standard Deviations (N = 411) 


Variable 1 2 3 4 5 6 7 8 

1. WG Inference 
2. WG Recognition of 

Assumptions .27 
3. WG Deduction .34 .34 | 
4. WG Interpretation .35 .27 .52 
5. WG Evaluation of 

Arguments .26 12 .29 .30 
6. WG Total Score .65 64 T4. 9 52 
7. Quick Word Test :80...,26 — 187. .37 17 44 
8. Henmon-Nelson +35 +25 42 .43 .30 .53 57 
Means 11.88 11.67 19.17 18.76 10.59 72.10 49.01 124.62 
Standard Deviations 2.74 3.25 3.21 2.93 2.16 9.62 14.30 15.10 


The mean WG total raw score was 72.10 with a standard devia- 
tion of 9.62. Mean total scores for the normative samples range 
Írom 57.7 for grade nine to 74.4 for college seniors (Watson and 
Glaser, 1964). The mean IQ on the HN was 124.62 with a standard 
deviation of 15.10. The mean score of 49.01 on the QWT is higher 
than the mean score (47.4) found among college freshmen (Bor- 
gatta and Corsini, 1964). 

Centroid factor analysis. The centroid factor analysis and the | 
rotated factor matrix are shown in Table 2. Using the procedure 
outlined in Harman (1960) for approximating the standard error 
of the factor loadings, the writers determined that loadings greater 
than .15 are significant at the .01 level. Therefore, only loadings .15 
or larger are reported in Table 2. 

Four centroid factors accounted for 76, 10, 6, and 6 per cent of the | 
total variance, All variables have high significant loadings on factor 
І. All but one of the loadings is .50 or higher and one of the loadings 
is as high as .71. | 

Two tests, QWT and HN, have loadings of .37 апа .33 on factor 
IL. Two of the WG subtests have significant negative loadings on 
this factor. 

Factor III has significant moderate loadings on all variables ex- 
cept WG Inference. The QWT, HN, and WG Evaluation of Argu- 
ments load negatively on factor IIT. 

Four of the seven variables have significant loadings on factor 
IV. The two highest loadings are WG Recognition of Assumptions 
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and WG Evaluation of Arguments which have negative loadings. 

Varimaz rotation. The four centroid factors were rotated through 
using the Varimax criterion, and the results are shown in Table 2. 
Factor I has significant loadings on all variables except variable 5, 
WG Evaluation of Arguments. Three subtests of the WG have load- 
ings from .50 to .65 on factor I. 

Factor II has two tests with high loadings on it; the QWT and 
the HN have loadings of .68 and .69, respectively. All variables 
except WG Recognition of Assumptions have significant loadings 
on factor II. 

Only one variable, WG Inference, loads significantly on factor II 
and this loading is a moderate .24. 

The two tests with the highest factor loadings on factor IV are 
two WG subtests; WG Recognition of Assumptions has a loading 
of —.53, and WG Evaluation of Arguments has а loading of —.52. 
With the exception of HN, the remaining variables show significant 
moderate loadings on factor IV. 2 

Discussion. A considerable amount of empirical data shows that 
а substantial relationship exists between critical thinking ability as 
measured by the WG and mental ability measured by conventional 
intelligence tests. Watson and Glaser (1964) report correlations 
between the WG and intelligence measures ranging from .55 to 75 
with the median at .68. The present study found a correlation of 
53 between the WG and the HN. Although this correlation is lower 
than that found in many measures of mental ability, the result is 


WG Evaluation of 
Arguments 45 

Quick Word Test .58 .37 —.25 Е 1 

Henmon-Nelson 66 34 .69 


TABLE 2 
Factor Structure of Seven Variables 
Factors 
ООВ РС SLE шр Rd 
Centroid Varimax 
DUDAS OE ROT ENSEEMENEATOS US i, 
Variables I п ш IV т ЙИ. HI XIM 
1. WG Inference .57 —.16 50 .16 .24 22 
2. WG Recognition of 
Assumptions .50 —.20 .16 —.30 .36 —.58 
3. WG Deduction т epe (nod .65 .34 -.17 
t WG Interpretation .69 .24 63 .33 —.21 
6. 


E 
: 
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consistent with the correlations found between the WG and highly 
verbal measures of mental ability such as the Miller Analogies 
(.55) and the Wechsler Adult Intelligence Scale, Verbal Scale (.55). 
The results of this study suggest that the WG may measure critical 
thinking abilities that are not being tapped by highly verbal intel- 
ligence tests. 

The factor analysis of the variables employed in this study re- 
vealed the presence of a strong general factor on the WG, although 
the loadings were reduced when the Varimax rotation was con- 
ducted. One factor was defined by the QWT, a vocabulary test, and 
the HN, a highly verbal intelligence test; both tests measure knowl- 
edge of word meanings and verbal discrimination to a greater ex- 
tent than the WG variables. Since three of the WG variables loaded 
primarily on one factor and two WG variables loaded on another, 
critical thinking abilities of higher ability students might form a 
structure different from that found in normal populations. 
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CONCURRENT VALIDITY OF THE PORTEUS MAZE 
TEST: A COMPARATIVE STUDY OF REGULAR AND 
EDUCATIONALLY HANDICAPPED 
HIGH SCHOOL STUDENTS?! 


HARVEY A. TILKER AND ROBERT E. SCHELL 
Michigan State University 


In a previous study it was possible to predict effectively who 
would or would not be considered employable among a group of 
educationally handicapped students participating in a special work- 
training program (Gambaro and Schell, 1966). It is perhaps not 
surprising that a number of school personnel, at the time of that 
study, considered the students participating in the program to be 
mentally retarded. Such a diagnosis seemed warranted insofar as 
these students had performed relatively poorly on an individual in- 
telligence test, had a history of course work failure for the last two 
or more years, and showed retardation of two or more years on 
scholastic achievement tests. Insofar as the results of that study are 
concerned, however, such a diagnosis would not seem warranted. In 
particular, the performance of the students on the Porteus Maze 
test was inconsistent with such a diagnosis. They scored too high. 

The study reported here attempted to clarify this diagnostic is- 
sue. In doing so, it provides data relevant to concurrent validity of 
the Porteus Maze test. 

Method. The Porteus Maze Test (PMT) was administered to 
154 students in regular school programs, and additional descriptive 
and test information was obtained on them as well as on the 71 ed- 


1 Supported in part by Grant No. 4143 from the All University Research 
Fund, Michigan State University. We wish to thank Mr. Robert Chamber- 
lain, the school principals, and the teachers and students of the Lansing Pub- 
lic High Schools, whose kind assistance made this study possible. 
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ucationally handicapped students previously studied. The regular 
students were randomly selected from rosters of full time students 
in the regular school program who were of the same age and grade 
level distribution as the educationally handicapped students. Both 
groups of Ss attended the same three high schools, ranged in age 
from 16 to 19 years, and all were juniors or seniors. 

The educationally handicapped students had been tested previ- 
ously as a part of their activities in the Special Education work- 
study training program. The regular students were asked to take 
part in the present study on the basis that they would be participat- 
ing in research. They were told that the research was being carried 
out in several schools in the city, that it involved a simple task, and 
did not take long. There were no refusals. 

For each S a quantitative (TA) and a qualitative (Q) score was 
derived on the PMT in accordance with Porteus’ (1959) scoring 
systems. The TA score is based on both the number of trials re- 
quired and the number of mazes successfully completed. The Q 
scoré is based on such things as frequency or extent of cutting 
corners, crossing lines, taking a wrong direction, and lifting the 
pencil from the paper. The two measures are negatively related and 
supposedly tap different aspects of behavior (Porteus, 1942) ? Pre- 
sumably TA is more a measure of what an individual can do, while 
Q is more a measure of how, or the way, the individual does it. TA 
is, then, supposedly more a measure of ability as such, Q being more 
а measure of the way this ability is expressed. As a measure of such 
features of behavior as impulsivity or carefulness, the Q score is 
felt by Porteus to relate to an individual's social and general ad- 
justment. 

For purposes of further analysis data were also obtained on socio- 
economie status and intelligence test performance. In order to have 
at least Some assessment of socioeconomic level on most Ss, the fol- 
lowing two measures were used. The dwelling unit of each S was as- 
signed & monetary value based on census data for the City of 
Lansing, Michigan (U. 8. Department of Labor, 1960). Assignment 
consisted of first locating each S's home address on a Базе map of 
the city and then assigning to that dwelling unit the average mone- 


2 In previous research, correlations from — 22 to — 44 h b 
between TA and Q on the РМТ. In the presen M eile. a oe 
combined samples is — 40 (N = 225, p n. и 
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tary value of all units on that particular block. House value was 
not determined where a S lived outside the city limits or on blocks 
with fewer than six dwelling units, nor where a recent address was 
unobtainable. For those Ss whose father's oceupation was specifi- 
cally determinable, an occupational classification was assigned. 
Five categories of classification were used, ranging from Profes- 
sional to Unskilled, in accordance with the Dictionary of Occupa- 
tional Titles (U. S. Department of Labor, 1955). 

Intelligence test data were obtained for as many of the Ss as pos- 
sible. In the ease of the regular students, only language and non- 
language IQ scores on the California Test of Mental Maturity 
(CTMM) could be obtained. As is customary, the CTMM had 
been administered in groups on a school-wide basis. For most Ss 
the test had been taken within the last year. In the case of the edu- 
cationally handicapped students, WAIS full-scale IQ scores were 
obtained. These students had not ordinarily taken the CTMM, and 
most of them had been given an individual WAIS as a part of the 
procedure for admission to the Special Education work-study train- 
ing program. 

Results—Correlation of PMT and IQ scores. It was possible to 
obtain CTMM IQ scores on 145 of the regular students, and WAIS 
IQ scores on 45 of the educationally handicapped students. Table 
1 shows the correlations between intelligence test and PMT scores. 
There are low positive significant correlations between intelligence 
test score and TA score on the PMT for both samples. And there is 
a low negative significant correlation between Q score on the PMT 
and nonlanguage IQ score on the CTMM for the regular Ss. For 
practical purposes the magnitudes of the correlations are negligible, 
and they appear to agree with previously reported values summa- 
rized by Porteus (1959). 


TABLE 1 
Spearman Rank Correlations between PMT and Intelligence Test Scores 
PMT Score 

1Q —————— 

Sample Test Score TA Q 
gular NL .23** —.25** 

E "mc L .22** 11 

Handicapped WAIS Full Scale .34* .08 


*p < .05. 
жер < .01. 
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The regular Ss’ median language IQ score on the CTMM was 109 
(range 58-137) and their median nonlanguage IQ score was 110 
(range 62-145). For the educationally handicapped Ss, median full- 
scale IQ score on the WAIS was 74 (range 55-90). On the basis of à 
median IQ score of 74, the educationally handicapped group would 
not be considered representative of the mentally retarded. 

Socioeconomic status. It was possible to determine father's occu- 
pation for all of the regular and 54 of the educationally handicapped 
Ss, and house value for 123 of the regular and 33 of the education- 
ally handicapped Ss. Analysis of the paternal occupation data in 
Table 2 gives a X? value of 20.96, df = 4, p < .001. A similar analy- 
sis on the house value data gives a value of 16.8, df = 4, p < 01. 
From inspection of Table 2, it is clear that the educationally handi- 
capped Ss have more fathers in lower occupational levels and fewer 
in higher occupational levels than do the regular Ss, and that they 
more frequently live in homes of lower monetary value. 

Comparison of samples on РМТ performance. As can be seen 
from the summary data presented in Table 3, the educationally 
handicapped group scores significantly lower than the regular group 
on TA, At the same time there is a fair amount of overlap in the TA 
score distributions of the two groups. Examination of Table 3 shows 
that both distributions tend to be negatively skewed, and in the case 
of the regular Ss the distribution reflects the low ceiling of the scale. 
According to Porteus a 14 year TA “can probably” be taken as rep- 
resentative of the average person in the general population. As 
compared to the average of the general population then, the regular 
group is comparatively bright and the median TA of the educa- 
tionally handicapped group places them well within a “normal” 
range. 


TABLE 2 


Distribution of Regular and Educationally Handicapped Ss by Paternal 
Occupational Level and by House Value ene 


Occupational Level House Value 

а ТАО ОИ 
Sample № ID is PA a у 15+ 1413 1211 109 85 
Regular 50 20 40 24 20 22 24 24 
Handicapped 3 5 16 19 n 3 3 2 T p 
Note.—Occupational level I 


indicates Professional, Technical, Managerial, and Self-employed; 
; Service. House Value is 
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TABLE 3 
Distribution of Regular and Educationally Handicapped Ss by 
Test Age Score on PMT 
Test Age on PMT 
Sample 7 8 9 10 11 12 18 14 15 16 17 Mdn U ? 
Regular 0001. 02 3 18 31 69 35 16.4 
Handicapped 5152 0 2 8 17 17 13 1 144 1955 «0 
Note,— Test Age to be read 7.0 or 7.5, 8.0 or 8.5, кдр Entries are number of Ss, Medians are based on original 
twi Я 


7. 5, 
score distributions. Significance of U value determined ro-sided test, 


In terms of Q score, the educationally handicapped sample scores 
significantly higher than the regular sample. As shown by the sum- 
mary data in Table 4, there is also some overlap of the two distri- 
butions on this measure. 

It perhaps appears paradoxieal that the handicapped students' 
TA performance places them within the "normal" range while, in 
terms of the Q measure, their performance is in keeping witb the 
label "retarded" (Porteus, 1959). The apparent paradox between 
the two findings, however, is resolvable on the basis of previous re- 
search. In studies where mean TA is about average or higher and 
mean Q is in the range 29-50, the Ss are likely to be called such 
things as “lazy,” “slow,” “illiterate,” “confused,” or “sloppy” (Doc- 
ter and Winder, 1954; Fooks and Thomas, 1957; Porteus, 1959; 
Wright, 1944). More likely than not they will also be found by their 
teachers to have unsatisfactory behavior in school as shown by “in- 
different effort and undependable work in completing assignments” 
(Porteus, 1959). Essentially their difficulties are not a function of 
“subnormal intelligence,” but are a function of motivation or other 
personal characteristics. 

It is also instructive to apply the cutting score on the Q measure 
which has previously been found to discriminate between such so- 
called dependable and undependable individuals to the Q score dis- 
tributions of the regular and handicapped students under study. 
Using a cut score of 29, 75 per cent of the educationally handi- 
capped Ss and 36 per cent of the regular Ss score above this value. 
These figures fit fairly well those previously found; in general 20- 
30 per cent of dependable individuals and 70-80 per cent of unde- 
pendable individuals score above this value. In spite of this gross 
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diserimination, however, it is also clear that using such a eut score 
would result in many errors of classification. 

Discussion. If the educationally handicapped Ss were retarded 
and representative of the population of all mentally retarded in- 
dividuals, they should have scored in the “mentally deficient” range 
on the tests used and come somewhat proportionately from all so- 
cioeconomic levels (Masland, Sarason, and Gladwin, 1958). If, on 
the other hand, they were retarded and representative of the “Gar- 
den Variety” retarded population, instead of all mentally retarded 
individuals, they still should have scored in the retarded range on 
the tests, but come disproportionately from the lower socioeconomic 
levels (Sarason, 1953). The results show that neither set of condi- 
tions is sufficiently well met. While a disproportionate number of the 
handicapped Ss do tend to come from the lower socioeconomic lev- 
els, almost every S scores above the “mentally retarded” range on 
at least one of the tests. 

It appears likely that unfavorable environmental conditions are 
largely responsible for the inefficient academic performance" dis- 
played by the majority of educationally handieapped Ss. In turn, it 
appears that their relatively poor everyday school performance 
was mistakenly reacted to and judged by a number of school per- 
sonnel to be due to intellectual deficiency per se. The relatively 
high TA performance in conjunction with the relatively poor Q 
score performance on the PMT, the borderline IQ score perform- 
ance, and the generally lower socioeconomic level of the educa- 
tionally handicapped sample are all consistent with what previous 
investigators have found for so-called disadvantaged or deprived 
Ss (Havighurst and Janke, 1944; Masland, Sarason, and Gladwin, 
1958; McCandless, 1964; Porteus, 1959; Sarason, 1953). Such Ss 
are characterized as tending to have values, interests, and habits 
that often make them misfits in the regular school classroom. They 
are likely to be indifferent, frustrated, or bored by the usual school 
studies and activities. From kindergarten on, they may have been 
unduly inattentive or distractible in the classroom. Frequently they 
started off in school failing and continued to experience failure in 
their school studies. Often, unless they are placed into some kind of 
special program, they end up being dropouts or are simply given 
“social promotions” from grade to grade. Although the extent to 
which the educationally handicapped Ss under study precisely 
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match such descriptions is unknown, the majority of them appear 
closer to being representative of such individuals than they are of 
the mentally retarded. Accordingly, the behavior they show, on the 
basis of which they have been selected for special attention by the 
school, is likely to have been as much a function of their personal- 
social adjustment or their intellectual efficiency as of deficiencies in 
their intellectual ability. 

This way of viewing the results is also applicable, of course, in 
the case of those Ss in the regular student group who had a rela- 
tively high TA performance on the PMT, possibly in conjunction 
with a relatively poor Q performance, a borderline or lower IQ 
score, and who came from lower socioeconomic levels; yet are not 
in the Special Education work-training program. From what we can 
tell, the appropriate answer would seem to be: they are less likely 
to be in the program because for some reason their personal-social 
characteristics in the school situation are acceptable. Apart from 
this surmise, however, a more dependable answer will require fur- 
ther tesearch. 

With respect to the results reported by Gambaro and Schell 
(1966), they are probably best understood as also reflecting the re- 
lationship between how a S performs on the PMT and his personal- 
social characteristics. That is, those student-trainees who were apt 
to be hired were not just likely to be more intelligent, but were 
likely to be more cooperative, more responsible, careful in their 
work, and in general more trustworthy. This way of viewing their 
results would also be supported by the moderately-high positive 


correlation they found between PMT performance and ratings of 
personal effectiveness. 
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UTILIZATION OF THE WECHSLER ADULT 
INTELLIGENCE SCALE (WAIS) IN PREDICTING 
SUCCESS WITH LOW AVERAGE 
HIGH SCHOOL STUDENTS 


STEVEN G. GOLDSTEIN 
Purdue University 
AND 
CHARLES T. LUNDY 
Gannon College 


Problem. This study was undertaken to determine the efficacy of 
using the full (11 subtests) WAIS to predict academic success (1.е., 
graduation) for the low average and below average high school stu- 
dent. 

Earlier studies. Although the acquisition of the verbal process 
and strong achievement needs seem to explain the success of the 
average and above average student as measured by standardized 
intelligence tests, it does not tap the abilities associated with suc- 
cess, in terms of school completion, of low and below average stu- 
dents. Research has focused on the mental retardate, the delinquent, 
or deviate, and poor readers. Newman and Loos (1955) found that 
mental defectives score significantly higher on the Performance 
Scale of the Wechsler Intelligence Scale for Children (WISO) than 
they scored on the nonperformance (or Verbal) portions. Burke and 
Bruce (1955) reported that poor readers (defined here as two grade 
levels below their grade placement as measured by standardized 
reading tests) were lower on the Information, Arithmetic and Cod- 
ing subtests than were good readers who, in comparison with the 
poor readers, were significantly higher on the Similarities subtest. 
It was further noted that, compared with good readers, the poor 
readers were higher on Picture Arrangement, Block Design, and 
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Comprehension subtests than on any other of the subtests. Similar 
results were found by Blank (1958) who showed that delinquents 
were usually higher on the Performance subtests requiring visual- 
motor co-ordination, and speed factors such as Object Assembly 
and Picture Arrangement than on Verbal tasks. 

The present study attempted to identify those variables on both 
Verbal and Performance portions of the WAIS which contribute to 
the accurate prediction of school completion for individuals classi- 
fied as below average in scholastie performance. There was no dif- 
ferentiation among the group according to age or sex. Since the 
group was statistieally homogeneous in terms of age, no analyses 
utilizing this factor as a variable was undertaken. Although Dailey 
and Winitz (1962) found significant differences on the WISC 
among children with a mean chronological age of 63 months (fa- 
voring the girls, who were higher on the Performance Scale IQ and 
on the Similarities and Coding subtests), Levinson (1963) reported 
no significant differences on WAIS subtests between a group of 30 
males and 33 females with mean education levels of 15.64 and 15.47 
respectively. These educational levels were comparable with the 
age group included in this study. 

Procedure. Subjects. Sixty-eight WAIS profiles were selected from 
а population of 200 profiles of high school students all of whom 
had been referred because of poor academic performance. The 
profiles finally selected were all from low average or below average 
ability students (IQ 102). The remaining 132 profiles all had IQ's 
greater than 102. The second selection criterion, that of academic 
Success, was determined by receipt of a high school diploma. 

Statistical analysis. Sequential multiple regression analyses 
(Fimple, 1965) employed the grade point index (GPI) of each sub- 
ject as the criterion value. While the authors were fully aware of re- 
cent criticisms concerning this variable (Chansky, 1964) they felt it 
was probably one of the stronger criteria as, at least in this case, it 
ш contribute heavily to the definition of academic success used 

ere, 

The regression analyses considered individual subtest scaled 
scores as predictor variables. A separate regression analysis (called 
a sequential analysis) was performed after each predictor variable 
deletion along with an analysis of regression (ANOR) for each set 
of variables prior to deletion. The significance value for deletion 
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was set at the .99 level of confidence. A preliminary analysis of the 
data yielded the means and standard deviations shown on Table 1. 
The criterion had a mean of 1.2996 or just above an academic aver- 
age of ^D." 


TABLE 1 
Means and Standard Deviations for Predictor Variables 
(WAIS Subtest Scaled Scores) 

Variable 2 8 
Information 7.67 1.86 
Comprehension 8.60 2.39 
Arithmetic 7.75 2.13 
Similarities 8.23 2.34 
Digit Span 8.04 1.95 
Vocabulary 6.86 1.92 
Digit Symbol 9.01 1.83 
Picture Completion 8.58 2.21 
Block Design 8.33 2.42 
Picture Arrangement 8.82 2.43 
Object Assembly 8.64 2.91 


The seventh sequential analysis appeared to be one maximizing 
the variability accounted for and gave a four predictor regression 
equation. Tables 2 and 3 show the ANOR for this point in the 
analyses as well as the deletion indices. 


Digit Span .0240 1.5141 
Picture Arrangement 
Object Assembly .0182 1.9499 


TABLE 2 
Summary Table for ANOR at Seventh Regression. Sequence 
Source SS df MS F 
Regression 1.5071 4 .9708 2.60 Е 94,60) = 2.52 
Error 9.1288 63 ‚1449 
"Total 10.6359 67 
TABLE 3 
Deletion Indices Showing Standard Error, t Ratio and F Ratio 
with Significance Values 
Standard $ 
Variable Error t Ratio F 
Similarities .0213 1.0788 2.82 F ,9o(1,60) = 2,79 
2.29 not significant 
.0226 2.3324 5.44 Е.ә(1,60) = 4.00 
3.80 


Е (1,60) = 2.79 
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Discussion. The results support the contention that the subjects 
were not dependent upon verbal cues for success. Except for Simi- 
larities, all of the verbal subtests were deleted almost immediately. 
However, it must be kept in mind in the discussion that follows that 
all of the subject mean scores were below the standardization mean 
for any one subtest. 

It appears that the best predictors of success with the low- 
achieving high school student are the Similarities (S), Digit Span 
(D. Sp.), Picture Arrangement (P.A.), and Object Assembly (O.A.) 
subjects. The subjects used for this study were relatively more able 
to diseriminate and assemble common objects and situations from 
their component parts than were their nonsuccessful contempo- 
raries. They also appear to possess better rote memory for a series 
of digits. 

Of all the predictors, the Similarities subtest remained strongest 
to the last step. In light of previous results stressing the importance 
of verbal facility in academic achievement, this result was not ex- 
pected. Rather, it was felt that the Comprehension subtest would be 
a stronger predictor and that this, in combination with P.A., would 
indicate a “social manipulation” variable. But if verbal facility was 
the predominant characteristic, the Vocabulary subtest, which was 
the first to be deleted, would have held up longer than it did. Realiz- 
ing that the Similarities score was not a result of verbal fluency per 
зе, it probably can best be ascribed to facility to make higher order 
generalizations but only to a limited extent and to a degree re- 
stricted by verbal fluency. 

It would seem apparent from the results that in dealing with the 
low-achieving high school student, one may actually be concerned 
with at least two groups. The group specifically of interest in this 
study would seem to have benefited more from an academic experi- 
ence, should its members have been identified earlier and been 
placed in an intensive program dealing with remedial work in the 
areas of both verbal and written expression. 

Summary. Sixty-eight high school students who had been referred 
for testing because of academic difficulties and who eventually were 
graduated were used as subjects in this study. WAIS profiles were 
obtained on all subjects, and sequential regression analyses were 
performed. on all profiles with the 11 subtests of the WAIS as pre- 
dictor variables and with the subjects’ Grade Point Index as the 
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criterion. The best predictor equation of GPI was made up of the 
Similarities, Digit Span, Picture Arrangement, and Object As- 
sembly subtests. It appeared from the results that much of the 
scholastic difficulty encountered by the subjects must be due more 
to a lack of fluency in verbal expression than to difficulties in the 
more abstract generalization realm. 
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SOCIOMETRIC FACTORS RELATED TO 
SUBSEQUENT SUCCESS IN A NURSING PROGRAM! 


DANIEL V. CAPUTO 
Queens College (CUNY) 

GEORGE PSATHAS AND JON M. PLAPP 
Washington University 


Tux present study is concerned with testing a number of hypothe- 
ses dealing with the relation between choices received and given on 
several dimensions of preferential sociometrie choice, and later suc- 
cess in nursing school. The major hypothesis is that students’ scores 
on a preferential sociometrie test given early in training will be re- 
lated to their subsequent success or lack of success in nursing 
school. 

First, the present study assesses the predictive validity of socio- 
metric choice scores in relationship to success and failure in a train- 
ing program rather than the concurrent validity of this method. If, 
as Epperson (1963) indicates, peer support is an important deter- 
minant of academic success the predictive validity model would be 
a stronger test of this hypothesis than would the concurrent validity 
model. 

The predietive utility of peer-evaluation measures has been 
tested against various criteria of success in diverse situations 
(Gronlund and Holmlund, 1958; Hollander, 1954; Kuhlen and Col- 
lister, 1952; Ullman, 1957; Wherry and Freyer, 1949; Williams and 


1 This paper is a partial report of а research project, “Role Differentials and 
Nursing Ideology,” John 6 Stern and Albert Е. Wessen, Co-Principal In- 
Vestigators, supported by Publio Health Service Research Grant No. NU 
00050 from the Division of Nursing Bureau of State Services—Community 
Health, A portion of the computation was done with support provided by the 
Washington University Computer Center under National Science Foundation 
Grant G-22296. 
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Leavitt, 1947). In view of the ease of administration and overall 
simplicity of the sociometrie method, it would be of importance to 
determine its efficacy as a selection device early in nursing training 
as well. 

Since intelligence or ability is strongly related to subsequent suc- 
cess in an academic situation, a low ability subsample of Ss was in- 
tensively studied to assess some of the questions proposed in this 
study. The low ability group, while serving as a control for intelli- 
gence, also represents a marginal group for whom the effects of peer 
support or rejection on success in training ought to be maximal (cf. 
Epperson, 1963). 

In addition to considering the predictive value of quantity or 
number of sociometric choices as related to success or failure in 
training, the predictive value of “quality” of sociometrie choice is 
also determined, employing the low ability group. Hall and Willer- 
man (1963) showed that the qualitative characteristics of stu- 
dents’ associates were related to academic performance. 

It would be of interest to discover whether such a relationship 
would hold for the “associations” between peers as revealed in 
preferential sociometric data early in training. Accordingly, the in- 
tellectual ability, sociometric status and subsequent success or fail- 
ure of peers sociometrically choosing successful and unsuccessful 
low ability Ss were compared. Further, students chosen sociometri- 
cally by the low ability Ss themselves were assessed on the qualita- 
tive dimensions noted. Thus, it was expected both that successful 
members of the particularly vulnerable “low ability” group re- 
ceived peer support from qualitatively superior peers and that they, 
in turn, were sociometrically attracted to such peers. In addition, 
reciprocity (or mutuality) of sociometric choice ought to 
discriminate successful from unsuccessful students. 

Another aspect of sociometrie choice considered in the present 
study is the relationship between non-receipt of positive socio- 
metric choices as compared with receipt of negative sociometric 
choices and later success or failure in the training situation. 

Further, Hollander (1956) in a study of naval officer candidates, 
pointed out that “friendship” or simple popularity differentially af- 
fected various categories of peer nominations to a significant de- 
gree. A related problem concerns whether diverse positive socio- 
metric categories differentially predict to success-failure criteria. In 
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the present case, a broad “like best" category and more circum- 
scribed “teammate” and “roommate” categories were compared 
with one another and finally, with the sum of all three of these posi- 
tive categories. 


The hypotheses tested were as follows: 


“Quantity” of positive or negative choices 

1. The absolute number, or “quantity,” of positive or negative 
choices received by a student early in training is related to his 
subsequent success in school. Thus, the successful student should, 
in general, receive a greater number of positive choices than the 
unsuccessful student. The predicted relationships should hold 
when the criterion for lack of success is (a) dropping out of nurs- 
ing school or (b) appearing on the probation or warning lists. 

2. It is expected that the unsuccessful student receives more nega- 
tive choices than the successful student, again employing the 
same two criteria of lack of success. 


“Quality” of the chooser 


Certain characteristics of the person who chooses a student may 
be related to the success, or lack of it, of the chosen student, This 
aspect of sociometrie choice may be referred to as the "quality" of 
the chooser. 

3. a It was predicted that successful low ability Ss (stayins) re- 
ceive positive sociometric choices from peers who are of higher 
ability, higher social class and who are stayins themselves than 
do unsuccessful low ability Ss (dropouts). 

b Low-ability dropouts should be chosen negatively by peers of 
higher “quality” than is the case for low-ability stayins. The 
performance of the low-ability student would be expected to 
suffer most if he were chosen negatively by a peer of high 
“quality,” since negative selection by such a peer would serve 
to deprive him of support by those most able to provide it. 


“Quality” of the chosen 

"The "quality" of those peers chosen by the student, particularly 
by the student of low ability, is related to the success of the choos- 
ing student himself. 
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4. a It is expected that peers chosen by low-ability successful stu- 
dents will be of higher “quality” than peers chosen positively 
by low-ability unsuccessful students. Such selection might re- 
flect the higher aspiration level of the successful low ability 
student. 

b Conversely, the low-ability student who is performing suc- 
cessfully should select on negative criteria peers of lower 
“quality” than the peers selected on negative criteria by the 
low-ability unsuccessful student. 

Mutuality of choice 

5. Although the fact of being chosen positively by peers is expected 

to favor successful performance on the part of the low-ability 
student, this effect is likely to be strongest when mutual or re- 
ciprocated choices are involved. Successful low-ability students 
are therefore expected to have a greater number of mutual 
positive sociometrie choices than are unsuccessful students of low 
bility. 

Method. The strategy of the method employed is to demonstrate 
that sociometric status is predictive of later lack of success in nurs- 
ing training, especially among students of marginal ability. 

Subjects. The subjects for this study were the 79 students forming 
the entire 1962-63 freshman class of a diploma school of nursing in 
St. Louis. Subjects were divided into two samples. The first con- 
sisted of all 79 freshman students. The second sample, the low- 
ability group, was a subgroup of the first and consisted of those 23 
students who scored below the class median on two measures of 
ability. 

Ability measures. Prior to the admission of each student to the 
nursing school, her rank in her own high school class (HSR) and her 
composite score on the Nursing Admission Test (NAT) of the 
Scholastic Testing Service were determined. The low ability group 
consisted of 23 students who fell below the median of the group on 
both of these measures. Scores of all students on the Otis Intelli- 
gence Test (Gamma AM) were also obtained prior to admission. 

Sociometric measures. The sociometrie test was administered to 
the entire group of Ss approximately two months after the admis- 
sion of the students to the school. Hach student was asked to name 
three classmates for each of three positive sociometric categories 
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(“roommate,” “teammate,” “like best") and for a single negative 
category (“like least"). A total of twelve names of fellow students 
was therefore called for from each student, though a particular 
student could be named for more than one category. The three 
positive categories were randomly arranged in terms of their order 
of appearance on the form, but the negative category was always 
last. 

Measures of the success criterion. Two measures of lack of suc- 
cess in the nursing program were employed: (a) dropping out of 
school in the first or second year of training; (b) being placed on 
the “probation” or “warning” lists of the school during the first 
quarter of the freshman year. To be placed on the probation list a 
S had to have an overall first quarter average of below C or be fail- 
ing in one subject. To be placed on the warning list a student had to 
have one grade below C. The number of Ss who dropped out of 
school during the two-year period was 27. The number of Ss whose 
names appeared on the probation or warning lists was 20. 

Measures of quality. Three measures of the “quality” of peers 
sociometrically choosing members of the low ability subsample and 
peers chosen by low ability Ss were obtained: (a) their ability 
levels, as indicated by their median scores on the combined HSR 
and NAT; (b) their social class ratings, made on the basis of ques- 
tionnaire reports by the students of the educational and occupa- 
tional status of their parents (Hollingshead, 1957; Hollingshead 
and Redlich, 1958), and (c) whether or not they were dropouts. 

Results. In general, analysis of the data revealed that the hy- 
pothesis predicting that “quantity” of positive choices received 
would be related to success was supported, while the hypothesis 
predicting that success would be related to the “quality” of choos- 
ing and chosen peers was only partially supported. 

The reliability of the sociometric method, as employed in this 
study was determined by correlating each of the sociometric cate- 
gories with the sum of all positive categories (TP). (See Gronlund, 
1959, Ch. 5). Pearson r was employed for comparisons involving 
the positive categories while the phi coefficient was employed for 
the negative (like-least category). Scores for first, second and third 
choices were simply summated within each category. Reliability 
was indicated at the .0001 level or better for each of the positive 
categories. The phi coefficient between like least (LL) and total 
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positive (TP) indiested significant reliability at the .025 level. 

Quantity of positive or negative choices, The results under hy- 
pothesis 1а, that dropouts are less attractive to their peers early in 
the training program, showed that the 27 students who dropped out 
of the program over the course of two years did not receive any 
fewer sociometrie choices for roommate (RM), teammate (ТМ), 
like best (LB), or total positive (TP) categories than did the 52 


(Ф < DI). The mean number of positive choices, standard devia- 

thems and results of chi-square analyses are shown in Table 1. 
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зо significant difference, employing the entire sample 


Probation-Warning (№ = 20) 


H 
Н 
Ir 
ss lf 
ir 


Moan Choices 200 1,45 215 t2 

Standard Deviation 2.01 273 4.35 4 ‘12 
Other Peers (№ = 50) 

Mean 3.10 an snm 10.10 

Standard Deviation 12 93 ою М 
Chi-equared* 10 7м 2353 69 b 
Bignificance level ха «0 NS «0 < 


И SS ge ag Sapo haw en ee 
бена torment the 2. X 2 gek 

which is not statistically significant for 1 df. Tbe disparity between 
means and standard deviations for these two groups was due to the 
presenco in the dropout group of two Ss who received 

fifty negative choices each.) 

When placement on the probation-warning list was employed as 
the eriterion of lack of success, statistically significant resulta in the 
predicted direction were obtained (Bee Table 1). The 20 чоное. 
ful Ss received a mean of 7.35 negative (like least) choles as oom- 
pared with 1.56 for the successful studonta, 
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status, the probation-warning list appeared to be a criterion of 
failure which was contaminated by the factor of intellectual ability. 
Aecordingly, low-ability, dropout-stayin groups were employed for 
testing both hypotheses 1 and 2 and the "quality" hypotheses. 

Of the 23 Ss in the low ability group, 12 dropped out within the 
two year period covered by the study, while 11 remained. The 
mean Otis IQ score for the low ability stayins was 111.00 and for 
the low ability dropouts, 110.25 (t = .44), a difference not statisti- 
cally significant. 

The results under hypotheses 1 and 2 for the low ability group 
are shown in Table 2. The mean number of choices for each of the 
positive sociometric categories significantly discriminated between 
the low ability dropouts and stayins but the mean number of nega- 
tive (like least) choices did not. In addition, for this low ability 
group the positive sociometric categories predicted equally well to 
the criterion, although TM choice, again, was slightly more effec- 
tive. (The mean and standard deviation for the like least scores of 

e 

TABLE 2 
Choices Received by Low Ability Dropouts and Stayins 
by Sociometric Category 


Room Team Like Total Like 
mate mate Best Positive Least 


Dropouts (N = 12) 


Mean Choices 2.70 2.17 2.40 
EN : } } 7.50 5.50 
Standard Deviation 2.18 7 
К ТЕ is 3.33 1.93 5.50 14.75 
ean Choices 4.20 3.45 3.63 11.20 1.45 
(бета Deviation 12 2.25 15 4.74 1.81 
Dus 22 30 38 45 
Significance Level 
pire .025 .01 —.025 .05 NS. 


the dropout group were spuriously raised by the presence of a sub- 
ject in this group who received 50 like least choices.) Among this 
и. ү, Ritt who, from an intellectual standpoint, are most 

‘one to failure, continuance seems to be i 
attractiveness, со 

“| Чч 

| о, о] the chooser. Hypothesis 3a predicted that for each 
i e positive sociometrio categories the ability and social class of 
those choosing low ability stayins ought to be higher than of those 
choosing low ability dropouts. A greater number of Ss who choose 
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low ability stayins positively should themselves be stayins; and, 
conversely, fewer stayins should choose low ability dropouts. Hy- 
pothesis 3b predicted that the low ability dropouts ought to be se- 
lected negatively (like least) by peers of higher quality than the 
low ability stayins. The only significant difference in this compari- 
son indicated that as predicted, those who chose the low ability drop- 
outs for first choice on the teammate category were of a lower social 
class (mean: 3.86; SD: .69) than those who chose the low ability 
stayins (mean: 3.00; SD: 1.22; U value was eight indicating signifi- 
cance at the .025 level). All the comparisons were in the predicted 
direction except for those involving like least (LL) and social class. 

However, the N involved for certain of the sociometric categories 
was quite small, so that hypothesis testing may not have been ade- 
quate. In order to test, hypotheses 3a and b more extensively, data 
involving all 3 sociometrie choices for each category, instead of first 
choice alone, were employed. In this case, the ability scores, for ex- 
ample, of all peers choosing a low ability stayin as either first, sec- 
ond or third choice for roommate were compared by t-test with the 
ability scores of all peers choosing a low ability dropout for the 
same category, thereby increasing the N employed considerably. 
This method of handling sociometric data may be thought to violate 
the requirement of independence of data since а single peer may 
choose three low ability Ss for the RM category; thus her scores 
may appear three times in the RM statistical analysis. However, 
Signorile and O’Shea (1965) challenge the contention that inde- 
pendence is violated as follows: 

“We are aware of no a priori arguments which would make the 
dependence of sequential sociometrie choices any more acceptable 
than their independence. Although individual effects may be de- 
pendent, which is problematic, the effects for a whole population 
may be random. Therefore, as sample size increases, there should be 
a tendency for dependence effects to disappear.” (p. 469) 

They recommend that dependence effects be ignored when the 
number of choices is small compared to sample size. 

The analysis of scores for each sociometric category for each of 
the “quality” indices yielded a significant difference only for the 
ability scores of those selecting the low ability Ss for like best (LB). 
This result was as predicted: the 39 peers choosing a low ability 
stayin for LB had a mean ability score of 7046 (SD = 35.54), 
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while the 30 peers choosing a low ability dropout for like best had a — 
mean ability score of 84.07 (SD = 30.80) a difference statistically - 
significant at the .05 level for a one-tailed test (t = 1.67, df = 67). 
(The smaller number indicates greater intellectual ability.) А simi- - 
lar trend (at the .10 level) was evident for the ability scores of — 
those choosing low ability Ss for teammate.. In all other comparisons — 
for ability, social class and dropout-stayin status of choosers of low | 
ability Ss the results were in the predicted direction except for the 
like least (LL) comparisons on ability and social class. The *qual- 
ity” of choosers of low ability Ss, then, relates to the later success or 
failure of these low ability Ss only to a minor extent, the “quality” 
of choosers for positive sociometrie eategories (RM, TM, LB) being 
of slightly more importance, perhaps, than the “quality” of choosers 
for the negative sociometric category (LL). 

“Quality” of the chosen. Hypothesis 4a states that for the low 
ability group, stayins should themselves choose for first choice on 
Positive sociometrie criteria individuals demonstrating more de- 
sirable characteristics than should the dropouts. Further, hypothe- 
sis 4b states that the stayins should choose for first choice on the 
negative sociometrie criterion of like least, individuals with less de- 
sirable characteristics than should the dropouts. These results were 
not significantly discriminating except that those peers selected by | 
the stayins were of a significantly higher social class (mean: 3.38; ] 
SD: 1.06) than the peers selected by dropouts (mean: 3.55; SD: .- 
69) for first choice roommate (RM). The U value in this case was 
26 (p < .05). The converse was the case for the first-choice like- 
least selections (hypothesis 4b) ; the means and standard deviations | 
for the peers selected by the stayins and dropouts in this case were 
respectively: 3.36, 1.29; 3.50, .67. The U value was 18 (p < .01). 
ix statistically significant results were in accord with the pre- 
kn те БА but two of the comparisons were, again, in the 

Маш significance were again applied to the data for all three | 
Ана 
results of these analyses in the assessment if [oun уак ү 
Low ability sip Diet invisi ypotheses 4a and b. 
thin did low abilities e select) more frequently - 
ыам and ТЫП pouts, peers of greater intellectual ability — 

, peers of higher social class for LB and RM. 
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These low ability stayins also chose peers of lesser ability for LL | 
and tended to choose for LL relatively fewer stayins than did low 
ability dropouts. 

Thus, the prediction that low ability stayins would choose, on 
positive sociometrie criteria, peers of higher “quality” tended to be 
supported more definitely in the case of ability and less so in the 
case of social class. The hypothesis that low ability stayins would 
view as objectionable peers who show less desirable characteristics 
(in comparison to those found objectionable by dropouts) tended to 
be supported in regard to ability but not in regard to social class. 
The proportion of stayins vs. dropouts chosen sociometrically by 
each of the low ability subgroups was discriminating only for the 
negative sociometric category. _ 

Mutuality of choice. Hypothesis 6, that mutual attraction ought 
more frequently to occur for successful low ability students than for 
unsuccessful ones, was assessed by recording а mutual or reciprocal 
choice if one S chose another 8 for first, second or third choice of a 
particular soeiometrie category and was selected in return by that 
S for first, second or third choice for that, same category. For the 
low ability dropouts, 12 reciprocal choices were possible, while for 
the low ability stayins 11 were possible for each category. The 
number of reciprocal choices actually received by dropouts and 
stayins for each category, were respectively: RM: 9.10; TM: 47; 
LB: 9.9. None of the Fisher exact test analyses was statistically 
significant although the data in each case were in the predicted 
direction, Mutual attraction, then, did not appear to be a significant 
factor in later success in the training program. 

Discussion. The quantity of sociometric choice, when employed 
over the whole range of 79 students, predicted to probation-warn- 
ing list placement but not to dropout-stayin status. The divergent 
results obtained through the use of the two criteria of lack of 
success, discontinuance and placement on probation-warning lists, 
require comment. Plapp, Psathas, and Caputo (1965) point out 
that the criterion of discontinuance is a relatively heterogeneous 
one in a nursing school setting. Reasons for dropping out range 
from academic failure, to “poor motivation,” marriage and preg- 
nancy. Thus, dropping out may not be as homogeneous & criterion 
of lack of success as placement on the probation warning list at 
least when selection is made across the entire sample of students. 
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While the interrelationships among intellectual ability, socio- 
metric choice and success in an academic training program seem to 
be crucial, it appears that for the low ability sample employed in 
this study, social class is of importance as well. Thus, the low abil- 
ity subgroups, while equated for intelligence, were markedly anom- 
alous on the parameter of social class. The low ability dropouts 
were from a lower social class background than the low ability 
stayins. A chi-square analysis of social class in relation to dropout- 
stayin status showed a statistically significant difference (chi- 
square = 12.40 p < .001). The low ability stayins (mean social 
class: 2.10) garnered, as has been shown} a greater number of socio- 
metric choices than did the low ability, dropouts (mean social 
class: 4.25). If social class influenced number of choices received, 
this effect should be evident across the entife sample of 79 Ss. How- 
ever, across the entire sample, number of sociometric choices on all 
three choices received was not significantly associated with social 
class; t-test comparisons between high, middle and lower classes 
did not show significant differences between classes. Thus, soóial 
class per se, does not serve to explain the discrepancies in the num- 
ber of sociometrie choices received by the numbers of each of the 
low ability subgroups. An interaction effect may be involved in this 
instance. It may be, for example, that higher social class status car- 
ries with it certain motivational and behavioral features which lead 
to the approbation of peers and which compensate, in part, for low 
ability in this group. It would seem, on the basis of this finding that 
the differential effect of social class membership in low ability or 
marginal groups merits further study. 

To assess whether maladaptive personality factors, present early 
in training, seemed to be found in greater degree among eventual 

“dropouts, an additional analysis was undertaken. The Edwards 
Personal Preference Schedule (1959) was administered in the stand- 
ard manner to the entire sample at the same time that the socio- 
metric was given. From the EPPS, а measure of inflexibility derived 
by Hartley and Allen (1962) was computed for the low ability 
dropouts and stayins. The factor-analytically derived measure was 
termed the “conformity” factor by the authors: 

"This constellation identifies the defensive conformist who needs 
to be constantly alerted against threatening change and disruption 
of self-preserving order and routine in his life activities. There is & 
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fear of the unexpected and untried. The major coping mechanism 
is inflexibility. ...” (p. 157) 

The EPPS scales loading highest on the factor, Deference, Order, 
Autonomy, (negatively) and Endurance were individually ranked 
over the low ability group Ss and an average rank was then ob- 
tained for each S over all four scales, The mean ranks for each of 
the four scales were in the direction indicating greater inflexibility 
for the dropouts, as were the mean composite ranks which were re- 
spectively 43.6 and 52.5. However, the U value was 47 which was 
not statistically significant. Since each of the scores was in the in- 
flexibility direction, a minor trend was evident indicating that low 
ability dropouts tend to be more inflexible than stayins of equal 
ability level. It may be this factor of inflexibility or “conformity” to 
which peers respond in making their sociometric selections. 

Peer group evaluation and selection whether revealed through 
soeiometrie choice or through measures of direct interaction (as in 
the study by Barber and Wessen, 1964) appear to have great value, 
in ‘studies employing both predictive and concurrent models, for 
ordering individuals on an evaluative (“good-bad”) continuum or 
at least a “good-not good” one and ultimately, for selection of indi- 
viduals. It is not clear whether intelligence, mental health, task- 
oriented behavior, or “flexibility” contributes most heavily to the 
positive or negative “stimulus value” of the individual, but all 
seem to be germane. 
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DEVELOPMENT OF A SCALE OF 
SOCIAL PERCEPTION: THE WLP' 


JANET P. MOURSUND 
Michigan State University 


How one person perceives another has been a fruitful area of 
study for personality theorists. Factors such as age (Yarrow and 
Campbell, 1963; Kohn and Fiedler, 1962), culture (Fiedler and 
Hoffman, 1962), status (Fujiwara, 1964, 1965), and subject-object 
similarity (Tomlinson, 1963) have been studied in conjunetton 
with variations in interpersonal perceptions. Throughout all these 
studies runs one implicit assumption: that interpersonal percep- 
tions are a relatively stable facet of personality. Yet the absence of 
standardized tools for measuring and/or quantifying such percep- 
tions is glaring. A number of articles (Alberoni and Silva, 1962; 
Ashcraft, 1964; Dornbusch et al., 1965) have pointed out, directly 
or indirectly, the need for such an instrument. 

An adequate scale of interpersonal attitudes should be sensitive 
to changes arising out of psychological and social maturational 
processes, should sample a relevant array of relationship areas, and 
should be amenable to quantitative manipulation and statistical 
treatment, Below is described an attempt to develop such a scale. 

Description of the scale. The Ways of Looking at People (WLP) 
scale consists of 53 item statements, each of which is to be re- 
sponded to on a Likert-type scale ranging from + + (I agree very 
much) to—(I disagree very much). Forty of the items are grouped 


1 This work was supported by CRP grant #1621 from the U. 8. Office of Ed- 
ucation, and was completed in part at the University of Wisconsin Psychiatric 
Institute and in part at the Human Learning Research Institute, Michigan 
State University. 
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into eight subscales of five items each; the remaining 13 items 
serve as fillers, 

The eight subscales which are used in evaluating the protocol are 
as follows: Similarity to Adults contains relatively straightforward 
items relating to the degree to which the respondee feels himself to 
be “like” an adult. Giving and Taking, the second category, deals 
with a variable which might be characterized as selfishness- 
unselfishness: helping others at the expense of one’s self, responsi- 
bility to others, ete. Pity and Blame is concerned with a permissive, 
forgiving attitude as opposed to a strict eye-for-an-eye philosophy. 
Basic Values is perhaps misleadingly titled; it deals not with the 
nature of the respondee’s values, but rather with the importance 
which he places upon awareness of these values. A relatively 
straightforward subscale is Confidence, which is concerned with the 
respondee’s estimation of himself and his capabilities. Liking Oth- 
ers deals with general sociability and friendliness, Trust and Mis- 
trust elicits information about the respondee’s opinion of others’ 
trustworthiness. Finally, Basic Nature asks whether people in gen- 
eral are “good” or “bad.” 

i Of the 40 scored items, 22. are worded in a reverse fashion; that is, 
high agreement reflects negative rather than positive attitudes. 
Scores may be tallied on a score sheet; a FORTRAN program for 
scoring individual protocols also exists. 

Development of the scale. The first step in constructing the WLP 
Nem simple listing of areas of thought in which one might expect 
social values and attitudes to form. These included such areas as 
loyalty io one's friends and personal independence. Within each 
area, questions were written which could be answered according to 
а 5-point Likert-type scale. The questions were so constructed as to 


be thought-provoking, and to have no obvious “right” or “wrong” 
answer. 


^ 
Eooder") answer was made evident on many items. New items were 


administered to a normative sample of approximately 2500 ado- 
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lescents. These subjects were currently enrolled in a public senior 
high school in Denver, Colorado; a private boys’ secondary school 
in Philadelphia, Pennsylvania; and a Catholic high school in Bos- 
ton, Massachusetts. The normative data were split randomly into 
two parts, balanced for sex, age, year in school, and geographic 
area, From the first half of the data, final selection of “score” test 
items and categories was made. The second half was used as a vali- 
dation sample for this final item selection. 

From the first half normative data, item response tallies were 
made on the basis of categorization by age, grade in school, sex, and 
post-high school plans. Items were selected or discarded on the basis 
of discriminating among these various sets of subgroups. 

It was found that age and grade in school provided by far the 
most consistent criteria for item selection. Items which tended to 
satisfy these criteria (choice being, at this point, on the basis of in- 
spection rather than any more rigorous statistical procedure) were 
divided according to value category. Categories which did not have 
a number of discriminating items were discarded. , 

Within the retained categories, inter-item correlations were run 
on all items (both diseriminating and non-discriminating). The de- 
gree of correlation with the category matrix served as another se- 
lection criterion. 

The final version of the scale looks on the surface exactly like the 
first revision: the non-score items were left as fillers. Items belong- 
ing to a given category are not grouped together on the scale, but 
rather scattered throughout the list. 

Another set of revised WLP data was collected as a part of a 
study of group counseling with high school students (Mathieu and 
Moursund, in press). The final version of the scale was used; the 
48 subjects were volunteers tested at the beginning and at the end 
of group counseling sessions (approximately a six-week interval). 

Discussion. Using the second half of the normative data, differ- 
ences among age groups, Sexes, socioeconomic levels, and regional 
groupings were tested by means of t tests and analyses of variance. 

On four of the eight, categories, there were significant, score in- 
creases with age. Three other categories showed this trend at à non- 
significant level (see Table 1). 

With regard to sex differences, girls scored significantly higher 
than boys on six of the eight subscales. Boys scored higher on only 
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TABLE 1 
Mean Scores for Three Age Groups on Eight 
Subscales of the WLP 
Age 
Scale 15 16 17 

S-A** 16.33 15.85 17.91 

P/B* 15.20 16.27 16.41 

G/T 19.05 18.51 18.61 

с 16.88 17.63 17.31 

L-O 18.01 18.00 18.80 

B-V 19.38 19.71 20.23 

T/M** 17.67 17.54 18.79 

B-N** 18.42 17.07 19.36 
жр < 10. 
** 5 < 05. 


one category, “Confidence”; and the difference was not significant 
(see Table 2). 


ji TABLE 2 
Mean Scores for Males and Females on 
Eight Subscales of the WLP 
Sex 
Scale Male Female 
S-A* 16.31 17.08 
P/B** 15.26 16.66 
G/T** 18.09 19.35 
17.59 16.96 
L-O** 17.39 19.14 
B-V** 18.87 20.68 
T/M 17.04 18.95 
B-N** 17.82 19.15 
*p < 10. 
** p < .05, 


"There was a clear trend for the middle category of socioeconomic 
level (SEL) rankings to score higher than either the high SEL 
group or the low SEL group. The differences were significant on 
three categories, and only one category, Confidence, was the trend 
not upheld (see Table 3). 

The differences between schools in WLP scores were quite clear 
and consistent for most categories. In six of the eight categories, the 
students from the parochial school in Boston scored highest, and the 
score differences were significant. Only the Confidence category 
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TABLE 3 
Socioeconomic Level Means for Eight Subscales 
of the WLP 
SES Level 

Category Low Middle High 
S-A 16.89 17.31 16.78 
P/B* 15.60 16.77 15.49 
G/T* 17.94 19.33 18.26 
С 17.63 17.33 16.99 
L-O 17.41 18.54 17.81 
В-у* 18.27 19.80 19.60 
T/M 17.27 18.32 17.76 
B-N* 17.60 19.31 18.10 


*p < .05. 


failed to place the Boston parochial students high; here they scored 
signifieantly lower than the other groups. 

Data obtained from the group counseling subjects offered oppor- 
tunity for both test-retest and inter-item correlations. Test-retest 
reliabilities for the eight categories range from .478 (Confidence to 
715 (Giving and Taking). More than 80 per cent of the inter-item 
correlations ranged between .159 and .692, with a median r of .263. 

Conclusions. The WLP appears to be a relatively reliable instru- 
ment. Its considerable content validity is supported by positive cor- 
relation between test scores and age. This evidence, together with 
the relationship between test scores and grade in school, and test 
Scores and socioeconomie level, indicate that a rather strong con- 
struct validity may be inherent in the instrument. 

The WLP does not exist in published form; nor is the test copy- 
righted. Copies of the test are available from the author upon re- 
quest. 
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PROGRESSIVE MATRICES—SCORES AND TIME 


RICHARD В. SCHNELL 4x» LOUIS DWARSHUIS 
Veterans Administration Hospital, Hines, Illinois 


Tue Raven's Progressive Matrices Test (Raven, 1960) has been 
used extensively as a test of general nonverbal intellectual func- 
tioning. In most instances it is used as a “capacity” test. The effects 
of the untimed nature of the test do not appear to have been ex- 
amined. The present study attempted to investigate partiallyethe 
relationship between total Progressive Matrices scores (standard 
form) and the length of time taken to complete the test. It was 
hypothesized that differences in standard scores between the Pro- 
gressive Matrices and another nonverbal test of intellectual func- 
tioning could be related to the use of time in completing the Pro- 
gressive Matrices. 

Subjects. Thirty-one residents of the Hines VA Restoration Cen- 
ter, a rehabilitation center, whose goals are to restore disabled 
veterans to more independent living, were used as subjects. Al- 
though the men in this group had all been discharged from a VA 
hospital (having reached maximum hospital benefits), they still 
displayed a variety of physical disabilities. They appeared to be in 
need of help to make a satisfactory adjustment to the community. 
All men known to have brain damage were excluded. The mean age 
of the group was 41, with a range from 20 to 60. 

Procedures and results. The men were first given the Revised 
Beta (Lindner and Gurvitz, 1957), a short timed test of nonverbal 
intellectual functioning. Then, one to two weeks later, they were 
given the Progressive Matrices with the instructions to take as 
much time as they needed to complete the test. The mean time for 


the group was 48 minutes. 
Pearson product moment correlation coefficients were computed. 
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The correlation between the Progressive Matrices and the Beta was 
57. A correlation of .15 was found between Progressive Matrices 
Scores and the length of time required to complete the test. Be- 
tween the time and the difference score (the Progressive Matrices 
standard score minus the Beta standard score) a correlation of .64 
was obtained. Negligible coefficients resulted from correlations be- 
tween the age of the subject and the difference score (.11) and age 
and the length of time taken to complete the Progressive Matrices 
(47). 

Discussion. With intelligence held constant, i.e. through using the 
Beta score as a base score, the difference in the Progressive Ma- 
trices scores and the Beta scores could partially be accounted for on 
the basis of higher scores on the Progressive Matrices being ob- 
tained by taking more time on the test and lower scores being real- 
ized on the Progressive Matrices by taking less time. The age of the 
subject does not appear to play a substantial part in these results. 
When the Progressive Matrices is administered, the use or nonuse of 
time by the subject appears to be a factor requiring consideration. 
Further investigation is indicated to study the variables affecting 
the length of time needed to complete the test. Possible hypotheses 
might be drawn from the work of such researchers as Johnson 
(1953) and Higashimachi (1963) who have found the Progressive 
Matrices scores to be related to personality variables. 

Summary. The effects of the untimed nature of the Raven's Pro- 
gressive Matrices (PM) were investigated by comparing scores on 
it with scores on another nonverbal test of intellectual functioning, 
the timed Revised Beta. Both tests were given to 31 residents of a 
rehabilitation center. A negligible correlation was obtained between 
the PM and the time taken to complete it. Scores computed by 
subtracting PM standard scores from Beta standard scores were 
found to be positively correlated with the length of time required to 
complete the PM. This result suggests that the use of time may bea 
variable to consider when using the PM. The research of others im- 
plies that this use of time might be related to personality variables. 
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A WORD PICTURE CHECK LIST FOR 
OFFICER EFFECTIVENESS REPORTS! 


CARL A. LINDSAY? ann JAMES M. McKENDRY 
HRB-Singer, Inc. 
State College, Pa. 


Verear description portions of military officer effectiveness rat- 
ings, termed “word pictures,” are considered valuable by raters and 
ratees alike, Yet, when attempts are made to use such data in de- 
termining some overall index of an officer's performance, difficulties 
are encountered because of the lack of techniques for converting 
such data into numerical scores. 

This paper describes one approach to the problem of evolving а 
word picture check list to replace the present narrative portion of 
Officer Effectiveness Reports (OER), a logical prerequisite to the 
eventual quantification of such descriptive data. In the present 
study, attention was focused on obtaining a list of items to be in- 
corporated into a word picture checklist through the use of item 
analysis procedures. 

Method—Initial item pool. An initial pool of item stems was ob- 
tained through a review of three potential item sources. First, 
seventy item stems were selected from the report of Hahn and 
Lichtenstein (1962) who classified on a frequency of use basis, 
phrases used in actual word picture sections of completed OER's. 
Second, a sample of 500 OER's was reviewed to determine whether 
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additional phrases were needed to supplement the first list. The sec- 
ond source failed to yield any additional items. Third, 30 item stems 
were contributed by a project consultant with extensive experience 
in modification of Marine Corps OER’s. 

The total of 100 item stems from these three sources was con- 
verted to check list items in accordance with the following guide- 
lines: (a) all items were short descriptive phrases, (b) items were 
oriented to overt behavioral manifestations, (c) items evaluative of 
the ratee’s personality were kept to a minimum, and (d) items were 
couched in both positive and negative terms. 

The check list questionnaire. The complete check list question- 
naire consisted of four sections: 


Section І: Ratee identification data and coded overall OER 
evaluation; 
Section II: The check list; 
Section III: Rater identification data; 
» Section IV: Rater’s evaluation of the check list. 


In Section II, the check list items were arranged in random order 
with response blocks marked “Yes,” “No,” and “Cannot Say” ar- 
rayed to the left of each item. Items were scored on a three-point 
scale, with “Yes” to a positively phrased item or “No” to a nega- 
tively phrased item receiving a weight of three, and the con- 
verse a weight of one. The “Cannot Say” option received a weight 
of two. 

The OER check list questionnaire together with the regular OER 
were distributed to rating officers, by the Air Force through normal 
channels, The raters were instructed to complete the regular OER 
for a given ratee in a routine fashion, and then, without referring to 
the completed OER, to fill out the OER check list for the same 
ratee. Sections I and III of the OER check list were self- 
explanatory. For Section II, the rater was instructed to indicate the 
descriptive applicability of each item as it pertained to the ratee, 
using the three response blocks. In Section IV, the rater was asked 
to indicate, on a four-choice scale, how strongly he would recom- 
mend (or oppose) the check list as a substitute for the Word 
Picture Section of the OER. 

Data acquisition. There were two phases to the data acquisition: 
an initial tryout based on the initial 100 items, and a second tryout 
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based on 56 items selected from the initial tryout. Response weights 
were correlated with two criteria in the initial tryout phase: (a) the 
ratee’s overall OER score (on a nine-point scale) for that particular 
rating period, r,; and (b) the ratee’s average OER score for all his 
previous ratings, fa. Items for the second tryout were selected in 
accordance with the following criteria: (a) item popularity, based 
on the percent of the ratees who received a favorable endorsement, 
should fall between .30 and .80; (b) item correlations with both 
criteria were to be maximized; and (c) nonstatistical considerations 
such as coverage of performance areas. For the second tryout, 
item responses were correlated with r, only. 

Subjects. In the initial phase, the check list was completed by 
raters evaluating company grade officers, i.e., lieutenants and cap- 
tains. Of the 1500 check lists distributed, 1304 were returned in the 
available time period, of which 1094 were properly completed. 
However, complete criterion data, viz, for each ratee at least two 
other OER’s were completed as well as a current OER, were avail- 
able for only 793 ratees. All item statistics were computed on the 
basis of the item responses for these 793 ratees without regard to 
rank, 

In the second tryout phase, a check list of 56 items selected from 
the original 100 items was developed. Check lists were sent to 1500 
raters in the same manner as for phase I, and 1189 acceptable ques- 
tionnaires were received. The acceptable questionnaires were di- 
vided into three groups on the basis of the ratee’s rank: second lieu- 
tenants (№ = 229), first lieutenants (NV = 428); captains (М = 
532). Item statistics were computed separately for each group 
through using the ratee’s current overall rating as the criterion. 

Check list statistics. As indicated, responses to the check list 
items were correlated with two criteria, current OER rating (ro) 
and average OER rating (fas) which excluded ro, for the initial try- 
out, and with т, only for the second tryout. For the initial tryout 
the correlation of total scores on the check list with то and Tay were 
calculated, and for г, only for the second tryout. An indication of 
the internal consistency reliability of the check list was developed 
by correlating a sample of 100 ratee’s odd and even scores, based on 
a final set of 45 items, for both phases. And finally, the per cent of 
the raters who endorsed each of the four alternatives for question- 
naire acceptability in Section IV were calculated. 
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Individual Item Statistics b 
_————————— 
Initial Tryout* Item Second Tryout ^n 


м— _——_—_Е— —_-————————————————— 


2ndLt^ 1stLts Сар4 — 


To Tav To To To 
а ОЦ 0 БЕ E t г Ша 
.53 .33 1. Dependable officer .43 .34 .39 
.89 .24 2. Plans and recommendations sound .29 .28 .33 
.89 .25 3. Exceptionally thorough, etc. .36 191 .34 
43 24 4. Particularly effective in manag. .36 .28 .28 
81 18 5. Requirements on time .24 .15 ‚19 
.26 ‚10 6. Accepts additional duty .03 15 18 
.38 .21 7. Considers problems clearly .26 .22 85, 
:82 :21 8. Lacks drive to be career officer .25 .24 .18 
.22 14 9. Lacks interest in present field .21 1 a 
81 18 10. Competence below average 21 19 .24 
18 12 11. Voluntarily improving education .04 .04 .07 
:43 27 12. Applies knowledge of specialty +25 .20 .85 
27 12 13. Reluctant to make decisions .23 .19 .18 
.28 M 14. Show consistent improvement .16 .14 .19 
.82 , .18 15. Motivated by strong interest .20 11 .22 
.42 .28 16. Displays unusual drive .34 .25 .38 
.42 .26 17. Unusually versatile 42 24 .82 
.24 15 18. Well informed 21 .24 .18 
.46 .28 19. Excellent, potential 97 .30 .37 
.35 14 20. Improvises imaginatively .35 .25 27 
.39 .25 21. Attitude thoroughly professional .28 .24 .25 
E 27 22. Well qualified in all aspects ‚36 .32 .29 
AB 27 23. Ready for reassignment - .36 .24 .84 
.22 13 24. Well qualified for advanced tr. .24 4 .19 
.29 .15 25. Duty affected by personal problems .19 .26 .19 
.24 12 20. itive to criticism .10 .08 .22 
.34 .19 27, Tends to react impulsively .18 18 .21 
—.08  —.16 28. Weak first impression —.02 —.02 —.07 
AT 27 29. Ability to impart knowledge +28 +29 .81 
.37 .26 30. Strong person; self-confident .82 18 .31 
27 16 31. Unusually high standards 13 .25 .18 
48 — ,30 32. Suited for command .33 .80 :84 № 
.88 04 33, Poised, articulate 5. 218 27 ОШ 
45 27 34. Noticeably more mature .34 .28 .28 
AT 24 35. Unusual high intelligence 87 +25 .31 
43 24 86, Inspires immediate confidence 29 82 .29 
.31 17 37. Interested only in specialty .18 .24 .16 
.31 .16 38. Readily devotes extra time .18 122 .2L 
.88 16 39. Determined to get job done .24 .24 .22 
.38 .22 40. More than average supervision .22 14 .30 
.30 .15 41. Tends to “take it easy” .24 .16 17 
40 .23 42, Reacts decisively .32 .21 .25 
.43 .24 43. Quick to understand .26 .21 .32 
.22 .08 44. Takes interest in subordinates .21 .16 ht 
.24 14 45. Highly effective training 15 .25 .21 
.40 .22 46. Directly responsible for improv. .30 .32 .20 
.27 .13 47. Works effectively with other org. .20 517. .15 
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————————————— 
Initial Tryout® Item Second Tryout 


2ndLt^ 18%14,°  Capt.t 


To Tav To To To 
— LLL 
.35 17, 48, Records, reports clear, concise 14 12 .29 
.30 17 49, Insures proper execution +23 25 22 
+20 12 50. Gives credit to subordinates +20 15 .08 
.35 .15 51. Solutions original, imaginative .22 24 .28 
.22 19 52. Instructions often ambiguous .23 .04 1 
.36 .19 53. Demands & receives high standards .35 .22 11 
.46 .28 54, Requires little or no supervision .29 .29 21 
.36 .18 55. Excellent executive .91 .30 .34 
.56 .34 56. Demonstrates competence .46 38 .98 
a N = 793 (total group), 

bN = 229. 
oN = 428, 
aN = 532. 


Results and discussion—Item statistics 


Table 1 provides the 56 individual items and their statistics se- 
lected from the first and used in the second tryout phase. For the 
initial tryout the correlation of the item with т, and fav is shown 
without regard to rank, For the second tryout, the item correlation 
with r, is shown by grade of ratee. For the initial tryout, the distri- 
bution of item-criterion correlations indicated that the items corre- 
lated more highly with то than with rav. In fact, 34 items correlated 
30 or higher with rẹ while only two items correlated that highly 
with ray. Possible reasons for this finding аге: (а) the relatively low 
(.538) correlation between то and fa» could be based on time уагіа- 
tions in а rater's perception of а ratee; the check list was completed 
by the rating officer at the same time the ratee was given an overall 
rating. Thus the possibility of a “halo” effect was enhanced which 
would inflate r, values; and (b) fa values contain the effects of dif- 
ferences between different raters’ perceptions of the same rate. 

It is also evident that item correlations with то were lower for the 
second tryout phase than for the first. For the second phase, only 
about 26 per cent of the correlations were above .30, while about 
44 per cent were above .30 in the initial tryout. This shrinkage may 
be due to the fact that since correlations were computed separately 
for officers of different rank for the second tryout phase, the vari- 
ability of the criterion was reduced. It was demonstrated that there 
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is an increase in the negative skew of the OER overall rating as the 
rank of the ratee increases (McKendry and Lindsay, 1964). 

Check list reliabilities and total score validities. Preliminary in- 
vestigations were made concerning the reliability and validity of 
the check list of 45 final items selected (items 6, 9, 26, 27, 28, 37, 41, 
45, 50, 52, 55 shown in Table 1 were omitted). Results of these 
analyses, which are shown in Table 2, indicate that the check list 
seems to be feasible and is probably metrically adequate. It should 
be kept in mind, however, that the item selection process may well 
have led to inflated correlations in these data; independent replica- 
tion is still needed. 


TABLE 2 


Preliminary Tests of the Check List’s Reliability and Validity 
А (N = 100 for all Correlations) 


Validity Correlations 


Corrected Split-half Current Average 
? Samples Reliability Correlations OER OER 
Initial Tryout 970 .615 414 
Second Tryout .949 417 no data 


Check list acceptance. The acceptance of the check list as a 
whole by rating officers was satisfactory. Seventy per cent of the 
initial tryout rating officers and 63 per cent of the second tryout 
rating officers recommended its adoption. A number of officers also 
made comments about the check list. For those favorable towards 
the check list, comments most frequently written were: check list is 
easier, more objective, does away with differences in writing ability, 
less time consuming. For those against the check list, comments in- 
cluded the following: not enough categories, ambiguous, doesn’t 
cover areas of responsibility, need at least 100 items, still need 
Word Picture section, too cut-and-dried, statements are canned, 
compares officer to an ideal. Several officers suggested that negative 
items should be eliminated and that OER be completed only on 
sub-standard officers. 

Conclusions and recommendations. The scope of the investiga- 
tion was limited to two questions. (a) Is it feasible to substitute a 
check list for the Word Picture section of AF OER’s? (b) If such a 
substitute is made, how would the users (raters) receive it? Results 
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indicate that the check list is feasible and that it would be accepted 
by rating officers, although an indoctrination period might be neces- 
sary before large scale implementation. 

However, important qualifications must be made to these con- 
clusions. The criterion for the item analysis was an internal one, 
with a relatively low degree of stability over time as judged from 
the correlation of r, and то, (.538). In addition, no separate test- 
retest estimate of the reliability of the final check list was calcu- 
lated. So, although a sizable number of items correlated satisfacto- 
rily with the criterion on two independent occasions, the stability of 
these results over time remains uncertain. Second, the dimensional- 
ity of the check list was unexplored. It is probable that some of the 
final items selected are redundant, and a saving of both time of 
rating officers and space on OER's may be effected. The problem of 
optimally weighting or scaling the items should also be investigated. 
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Remarks on Jackson’s “Review” of Block’s Challenge of Response 
Sets. 


It is the responsibility of a book reviewer to form and to con- 
vey a balanced judgment of the work he is assigned to review. In 
order to do so, dispassion rather than passion should guide his 
thinking and his remarks. If his impartiality is suspect, either to 
himself or to his prospective readers, then a reviewer should dis- 
qualify himself from the task at the outset in order to deyote him- 
self more frankly to the expression of views or arguments not prop- 
erly placed within a review. 

I believe that Douglas Jackson should not have let himself “re- 
view” my book, The Challenge of Response Sets. The focus of пу 
work, the twin themes of acquiescence and social desirability in 
personality inventories, centrally concerns his own work in a way 
that was completed by my findings to be critical. He is entirely 
welcome to counter my arguments, dispute my facts, and to ad- 
vance new definitions and data as he has done. However, his re- 
marks are better labelled as a “critique” than as a “review.” I can 
accept, albeit perhaps ruefully, an unfavorable evaluation by a 
reviewer when the usual academic ground rules are employed. But 
I see Jackson’s “review” as a further and excessively propense 
expression in a continuing dialogue on the substantiality of ac- 
quiescence and social desirability as explanatory constructs in per- 
sonality assessment. Accordingly, I am moved to respond, if only 
briefly. “ 

"There are many aspects and claims of the Jackson "review 
which require clarification or confrontation but it would be fatiguing 
and simply not worth the effort to pursue them all. Instead, I 
choose to consider, as a suggestive example, the very first “error” 
Jackson finds in my book. The general nature of Jackson’s later 
charges of error may be better evaluated after this counter-coun- 
ter-counter-analysis. The implication of Jackson’s attempt to shift, 
at this late date, the operational definitions of acquiescence and so- 
cial desirability is then drawn, a consequence not spelled out by 
him, Finally, our general agreements and disagreements are noted. 
With respect to the full panoply of defense and depreciation that 
Jackson has constructed, the reader really has no recourse but to 
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read both book and "review" if he is to be able to evaluate fully 
either. If he does so, I am most content to let my book and the 
Jackson "review" each stand on their own merits. j 

Item overlap in the analysis of “true” MMPI subscales. Jackson 
is disturbed by my demonstration that item overlap created per- 
vasive and not small correlations among the “true” MMPI sub- 
Scales and, separately, among the “false” MMPI subscales but 
could not impose correlations between “true” subscales on „the 
one hand and “false” subscales on the other, As a consequence of 
these entailed relationships, a factor analysis of the over-lap pro- 
duced correlations necessarily issues a factor structure which per- 
fectly separates “true” and “false” subscales, a separation that 
Jackson and Messick earlier had emphasized as substantive evi- 
dence for acquiescence. 

Jackson’s response to this finding that a logical constraint had 
been interpreted as an empirical result takes two forms: (1) The 
problem was recognized all along. “Jackson and Messick in con- 
structing their special response style marker scales systematically 
minimized item overlap with MMPI clinical scales”; (2) Any ef- 
fects of item overlap had been eliminated. “ (Jackson and Messick) 
sought to render the effects of item overlap harmless by rotating 
item overlap factors to a position orthogonal to response style and 
to content factors. . . ." 

The first of these answers is incorrect and introduces an irrele- 
уапсу. The motivated reader is invited to inform himself directly 
regarding the validity of the following categorical statement: I as- 
sert, contrary to Jackson’s claim of sophistication regarding the 
point at issue, that nowhere in the course of their two papers 
analyzing the “true” and “false” MMPI subscales do Jackson and 
Messick recognize the special way in which item overlap operates 
once MMPI scales are separated into their "true"-keyed and 
"false"-keyed components. Apparently, this recognition is still not 
enjoyed since Jackson in his “review” mentions minimizing item 
overlap in so-called response style marker scales (themselves sub- 
ject to criticism on other grounds) as a relevant control on this 
problem. These marker scales have simply nothing to say about the 
effects of item overlap within the sets of “true” and of “false” 
MMPI subscales. 

The second answer Jackson employs ascribes truly magical 
power to rotational procedures, ie. the ability to separate two 
collinear dimensions. He asserts that his argued-for acquiescence 
factor, which perfectly separates “true” and “false” MMPI sub- 
scales, can be cleanly distinguished via rotation from the indub- 
itably-present artifact factor which also perfectly separates the 
“true” and the “false” MMPI subscales, This assertion betrays a 
misconstruction of factor analysis and the logic of rotation. The 
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collinear, confounding source of artifactual variance introduced by 
the special overlap component cannot be whirled away by ac- 
quiescence-serving rotations; rather, its presence at the outset must 
be excluded if an unequivocal interpretation is desired. There is no 
other way. 

Redefining the indicators of response sets. One of the problems 
in making psychology a cumulative science is that controversies 
may go on for years simply because different investigators have 
operationalized what is purportedly the same concept in funda- 
mentally different ways. To avoid such straw possibilities, it was 
my explicit intention in my own analyses into acquiescence and 
social desirability to control these variables in ways that derived 
without question from the logie earlier employed by the protago- 
nists of these concepts. I view this feature of my research orientation 
as a very strong one for it prevents the usual sidestepping and 
diffusing reaction to contrary results, namely, “but we are meas- 
uring different variables.” 

Thus, I reasoned that if, as Messick and Jackson argue in their 
influential 1961 article, imbalance in the number of true and false 
keyed items in a scale is evidence for the presence of acquiescence, 
then it follows that a balance in the number of true and of false 
items is a control on acquiescence. Similarly, I reasoned that if the 
keying of items in accord with their social desirability scale values 
had been employed as the sole and sufficient basis for construing 
these items in desirability terms, then if follows that a scale 
whose items could not be related to social desirability scale valued 
could not be so interpreted. 

In his “review,” Jackson now argues against the methods I em- 
ployed to control acquiescence and social desirability. What he 
does not realize is that an argument against balancing scales with 
respect to the number of true and false items is simultaneously an 
argument against his (and Messick’s) prior use of scale imbalance 
as an indicator of acquiescence. An argument against the use of the 
Messick and Jackson social desirability scale values as a suffi- 
cient basis for exeluding the contribution of desirability is simul- 
taneously an argument against the many uses of these identical 
scale values to support the contention of a social desirability ef- 
fect. Jackson cannot have it both ways—his arguments against 
the controls I employed also and necessarily sweep away the kinds 
of evidence that had been used to advance the response set position. 
Т am not sure he fully realizes the retroactive implications of his 
current stance. 

Where are we now in this protracted controversy? We may be 


closer to an effective agreement than the tenor of Jackson’s “re- 


view” suggests. Thus, both Jackson and I agree there is no evi- 
dence for acquiescence when scales balanced for true and for false 
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items are analyzed. We differ in that I believe balaneing the num- 
ber of true and false items is a proper and sufficient control for 
acquiescence while Jackson has changed his mind, suggesting now 
that such balanced scales “might still yield scores influenced by 
acquiescence.” This tentative conjecture by Jackson rests prima- 
rily on a new and as yet untried approach to indexing acquiescence 
in terms of "average true score variance." We will have. to await 
the empirical results Jackson generates by this new operational 
translation of acquiescence to see whether, in these new terms, 
acquiescence finally becomes a viable explanatory concept with, 
finally, a creditable and unconfounded basis in measurement. 

Interestingly, and perhaps pre-figuring an ultimate withering 
away of the current controversy, when presenting “A Modern 
Strategy for Personality Assessment: The Personality Research 
Form” in a 1966 address before the American Psychological Associa- 
tion, Jackson no longer chooses even to mention the concept of 
acquiescence although the response styles of desirability and in- 
frequency continue to attract his attention. 

With respect to the concept of social desirability, Jackson and I 
again agree that the first dimension of the MMPI has important 
and many personality correlates in the larger world. We differ in 
that I believe this finding has appreciable significance since the 
social desirability construct as formulated by Edwards explicitly 
denies the characterological significance of the first MMPI factor; 
Jackson asserts the social desirability formulation need not be 
embarrassed by the behavioral correlates of the factor dimension 
it proposes to explain. 

With respect to the MMPI, Jackson and I agree that this in- 
ventory is a generation old and can be improved upon. We differ 
in that I believe the MMPI was, for its time, a significant con- 
tribution to the field of personality assessment while Jackson 
grants on the test little if any respect. 

Our agreements are reassuring; our disagreements are perhaps 
inevitable. Let our peers decide on the matters that still separate us. 
In the meanwhile, I suggest that Jackson and I can now make more 
effective use of our energies by directing them toward more vital 
psychological issues. 

Jack BLOCK 
University of California, 
Berkeley 


Balanced Scales, Item Overlap, and the Stables of Augeas. 

Jack Block has requested the privilege of objecting to my review 
of his book. He feels that I am biased and that my review is 
“excessively propense.” He would have the reader believe that the 
many errors in his logic and method documented in my review are 
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somehow to be evaluated in terms of the person presenting them, 
rather than in terms of their cogency and accuracy. But, fortunately, 
there is a body of knowledge in psychometric theory and practice 
to which objective reference can be made. While it is perhaps 
understandable that Block finds my review difficult to accept, many 
of the criticisms presented are not simply matters of opinion. They 
are not of the sort which would evaporate before the eyes of even 
the most sympathetic reviewer. Block would prefer that those 
whose research he has considered be excluded from the list of 
legitimate reviewers. But, in so doing he might exclude those most 
interested and most knowledgeable about the issues. If my review 
appears long, it was only because I wished to provide the reader with 
both a fair summary of Block’s position, and sufficient documenta- 
tion to provide a basis for following a line of reasoning and criticism, 
rather than presenting only my conclusions. Since Block’s reply 
to my review introduces questions (and misconceptions) about the 
nature of item overlap factors on the MMPI, and about definitions 
of response styles, some further clarifying remarks are in order. 

The abuses of factor analysis: an illustrative example. Block in 
his “remarks” has chosen a single issue in my review as а “sugges; 
tive example.” He advises the reader that this issue will provide & 
basis against which “Jackson’s later charges of error may be better 
evaluated . . .” In my review I indicated that Block's analysis of 
item overlap as an explanation of the acquiescence factor was in 
error, and that he failed to report that carefully devised controls for 
overlap had been introduced by Jackson and Messick. Since it is 
Block’s stated wish to evaluate my review in terms of this issue, I 
will discuss his treatment of item overlap in greater detail than was 
originally undertaken in the review. Hopefully, this discussion will 
also serve to clarify the broader issues involved and to allay some 
of the misconceptions which might result from a too casual reading 
of Block’s book and “remarks.” 

Block suggests that Jackson is “disturbed” by Block’s analysis of 
MMPI item overlap. It was indeed disturbing to find so many errors 
in Block’s analyses, particularly because this was a putatively 
psychometric effort with pretensions at rigor. Block factor analyzed 
“common elements" correlations between true and false keyed 
MMPI clinical scales. He extracted only two factors with the 
centroid method. The first unrotated centroid separated true and 
false keyed MMPI scales, leading Block to the conclusion that the 
Jackson and Messick report of an acquiescence factor was “subject 
to an alternative explanation.” There are five reasons why the 
reader should be very cautious about taking Block’s claim seriously: 
(1) Block committed a serious error in regard to the nature of 
factor analysis of item overlap factors. “Two factors,” Block (p. 16) 
states, “of course, extract all variance in this totally constrained 
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situation . . .” Of course, this statement is incorrect and demon- 
strably absurd. The only a priori constraint on the number of pos- 
sible factors based on either correlations of common elements or of 
psychological traits is one less than the order of the correlation 
matrix, assuming communalities approaching unity, Thus, multipli- 
cation of Block’s factor matrix by its transpose does not adequately 
reproduce the original correlations. Anderson and Bashaw (1966) 
have demonstrated empirically that two, three, four, and five factor 
solutions of MMPI overlap factors are possible. Jackson and Mes- 
sick interpreted four overlap factors. The reviewer factored Block’s 
matrix by principal axes and found that more than twelve factors 
showed positive eigenroots. (2) Arithmetic errors and the use of the 
anachronistic centroid method fail to inspire confidence. Block re- 
ports (p. 16) a mathematically impossible communality of .61 for 
НБ, when the two factor loadings are .162 and .186. Block’s reported 
factor loadings are substantially at variance with those derived from 
a principal axes solution. (3) Block failed to take into account the 
relative magnitudes of the acquiescence and item overlap factors. 
In Block’s analysis, his unrotated first centroid factor accounts for 
but eight per cent of the total variance. Jackson and Messick’s 
(1961) acquiescence factor derived from the prison sample ac- 
counted for 31.4 per cent of the total variance of their correlation 
matrix. Block’s “alternative explanation” would appear to leave 


about three quarters of the variance previously identified as acquies- _ 


cence unexplained. A similar conclusion in regard to the effect of 
item overlap factors on MMPI structure has recently been presented 
by Anderson and Bashaw (1966). (4) The failure to rotate axes 
resulted in a misinterpretation of the nature of the item overlap 
factors and their relation to those identified by Jackson and Mes- 
sick. Had Block extracted a sufficient number of factors, and had he 
rotated these according to customary factor analytic practice, he 
would have identified about the same overlap factors as did Jackson 
and Messick, Over thirty years ago, Thurstone demonstrated the 
need to rotate axes'in factor analysis to enhance interpretability. 
Block has chosen to turn his back on thirty years of factor analytic 
experience and on the availability of objective, analytic rotational 
procedures. In so doing he has compounded error. The first unrotated 
centroid of bipolar personality scales usually results in a factor 
with positive and negative loadings. But in the present instance, 
when true and false keyed MMPI subscales share very few or no 
items in common, to interpret them as falling on the same factor is 
not reasonable. Separate orthogonal factors of true and of false keyed 
items would appear to be a more plausible alternative. An exami- 
nation of the factors emerging from the reviewer's principal axes 
analysis of the Block common element correlations confirms this 
and shows that these factors match those of Jackson and Messick 
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very well. Five factors were rotated to a normalized varimax 
criterion. The following are the salients for the four largest factors 
from this analysis, followed by the positive salients for their coun- 
terpart factors among MMPI clinical scales identified as due to item 
overlap in the Jackson and Messick analysis. 


Factor I: 8с, Раг; Ра, Sc; 

Factor П: Из, Нуу; Ну, Hs; 

Factor III: Hy; Юу, Hs; Pti; Ну, Нз, Dt, Рі, 
Factor IV: Ра, Ма, Нуу; Ма, Pd;, Scy. 


This is a remarkably close fit, especially considering the fact that, 
the Jackson and Messick studies contained many scales omitted 
from Block’s analysis. Furthermore, overlap factors from the present 
analysis and from that of Jackson and Messick account for similar 
proportions of the variance. Perhaps, finally, when presented in this 
way, Block will be unable to ignore the earlier Jackson and Messick 
findings on overlap, as he did in his book. (5) Item overlap factors 
do not account for major acquiescence factor loadings. Block 
charges that the reviewer has introduced an ‘Grrelevancy” by point- 
ing out that the special response style marker scales were con- 
structed in such a way as to limit systematically item overlap witle 
clinical scales. If this is an “irrelevancy,” and if the factor previously 
identified as acquiescence by Jackson and Messick is due to item 
overlap as Block insists, then how does Block account for the load- 
ings of the five all true keyed Dy scales of the order of .36, .55, .88, 
.78, and .53, replicated across three samples? Surely, loadings of 
this magnitude cannot be dismissed as mere irrelevancy. 

On the definition of response styles. Block accuses Jackson of 
changing his mind with respect to balancing the number of true and 
false items as a proper and sufficient control for acquiescence, He 
does not (and cannot) defend the proposition that his true and false 
keyed subscales, with gross differences in variance and reliability, 
can be considered balanced in any sense. Rather, he blames Messick 
and Jackson for leading him to consider items and scales equal in 
acquiescence-eliciting potential. Fortunately, it is possible to treat 
this as a question of fact and resolve the matter by referring to the 
original article. After reviewing relevant research, Messick and 
Jackson (1961, p. 302) point out that high correlations have “been 
reported between MMPI scales and desirability measures having а 
balanced number of true and false items . . .” Messick and Jackson 
continue, “In an attempt to take these findings into account, it is 
suggested that the acquiescence-evoking properties of items are not, 
as assumed above, uniform over all scales, but that acquiescence is 
elicited differentially as a function, perhaps, of specific item content, 
of the clarity or ambiguity with which the content is stated, and in 
particular, of the perceived desirability of the statement.” This 
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statement is unequivocal enough. Block's use of the Messick and 
Jackson position as a justification for "balancing" true and false 
subscales for acquiescence ean hardly be taken seriously. Indeed, 
had he heeded the quoted statement, he might have avoided certain 
of the errors outlined in my review. Incidentally, Block's statement 
in his “remarks” that “. . . Jackson and I agree there is no evidence 
for acquiescence when scales balanced for true and for false items 
were analyzed,” is not a true statement. I can only express wonder 
at how Block could possibly draw such an inference from my 
review. The interested reader may consult detailed statements 
of my views on this issue (Jackson, 1967; Messick and Jackson, 
1961а). 

Block does not challenge the findings reported in my review that 
when his "desirability-free" scale is alfo the desirability of à 
false response, it is manifestly not desirability free. Rather, he 
ascribes his error to the Messick and Jackson (1961b) desirability 
scale values. A perusal of the article reporting these scale values, 
however, reveals that a major portion of the introduction and dis- 
cussion emphasized the multidimensional nature of desirability, and 
the fact that the values reported were average values. It has been 
known since 1959 (Wiggins and Rumrill, 1959) that ratings of the 
desirability of a true and of a false response were not entirely sym- 
metrical. This should have caused Block to recognize that his 
“desirability-free” predominantly negatively-worded"items might 
not be desirability free when scaled from the other direction. 
No one has changed the definition of desirability, nor do Block’s 
findings for a handful of items change the basic conclusions re- 
garding average desirability and the bulk of MMPI items and 
scales, 

Thus, Block, by raising a “suggestive example,’sand a charge that 
I have redefined acquiescence and desirability, has encouraged 
further scrutiny of the adequacy of his logic and methods, and the 
accuracy of his citations. Both have been found wanting. Added to 
the mixed assortment of Block’s errors and confusions outlined in 
my review, of conveniently reporting in a crucial analysis only three 
factors when at least ten were interpetable, and of harangues about 
“clinical lore,” we now find certain elementary errors in Block’s 
understanding and execution of a factor problem, and further 
erroneous statements regarding the position of others. 

There is a Greek myth of the stable of King Augeas of Elis. There 
were housed three thousand oxen in this stable which remained un- 
cleaned for thirty years. The MMPI has been with us for almost 
thirty years. Its authors had high hopes for ‘its usefulness in dif- 
ferential diagnosis, a hope as yet largely unrealized. Many dedicated 
research workers and professionals have expanded our knowledge 
of assessment with this device. But the seemingly endless prolifera- 
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tion in the name of science of empirically-derived scales measuring 
largely only two dimensions has involved much effort which might 
better now be re-directed. While the number of such scales has 
probably not yet reached three thousand, the stable (which now 
includes Block’s modest contribution) is becoming crowded. It is 
time to consider the task of cleaning. Or, because no Hercules has 
appeared in our midst, we had better seek fresher pastures and more 
modern quarters for a somewhat more carefully selected herd of 
oxen. Had Block laid the foundation for such a search with a fresh 
approach to personality assessment, I could have in honesty hailed 
it as attacking an important problem. Instead, we find only a 
further routine application of empirical item selection to produce 
yet another set of “seales” almost totally lacking in discriminant 
validity. In the future, gather than further overpopulate a barn 
already acknowledged by all to be inadequate, let us aspire to blaze 
new pathways. 
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Testing Problems in Perspective: Twenty-fifth Anniversary Volume 
of Topical Readings from the Invitational Conference on Test- 
ing Problems by Anne Anastasi, Editor. Washington, D. C.: 
American Council on Education, 1966. Pp. xiii + 671. $10.00. 

From time to time publications are made in every field that its 
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adherents will purchase for prestige or for sentimental reasons, ir- 
respective of content. The scholar in the diseiplinary area in which 
these books appear will want to own a copy to demonstrate his 
being "in," to display his loyalty to his profession. Of course, many 
people will genuinely want to read such books—newcomers to the 
area out of interest in their predecessors’ viewpoints, established 
artisans out of some vague nostalgia. Testing Problems is one such 
volume: a fascinating collection of psychometric bedtime reading. 

There can be little doubt that the Invitational Conference on 
Testing Problems has been, since its inception in 1936 by the Ameri- 
can Council on Education (ACE), a high point on the psychometric 
calendar. Whether the expansion in size, coverage of topics, and in- 
creasing heterogeneity of participants’ interests that has occurred 
with the passage of time has been beneficial to, or has had detri- 
mental effects on, the payoff from the Conference are questions that 
need not be answered directly here. But they are data sources that 
are useful in evaluating the substance of the Conferences and, there- 
fore, the content of the book. 

Because of the widening breadth of interest among the persons at- 
tending the Conference the papers have tended to become less spe- 
cifically technically-oriented and have included other than content 
dealing with “testing problems” (despite titles implying a continu- 
ance of a specific focus). 

What this volume makes very clear is that the papers presented 
provide—with rare exception—no substitute for reading in the orig- 
inal the works of the many distinguished authors. That the papers 
did, when they were read at their respective conferences, make much 
of the audiences aware of new developments and new thinking on 
many aspects of evaluation, there is probably no doubt. Yet it 
would be interesting to “discover” somehow just what impact these 
papers do have, and have had. When one looks at a “random” se- 
lection of papers—the volume opens successively at Saunders on 
moderator variables; at Tucker on multimode factor analysis; at 
Bayley on the curve of intelligence—one’s appetite is certainly not 
assuaged. But it must be granted that maybe it was not intended to 
be; maybe the purpose was to make some people aware of the ex- 
istence of certain techniques, or to make them rethink issues, 

By now, however, much of the innovation that these speakers 
and their papers encouraged, has become part of psychometric heri- 
tage. Of course, the topics around which the papers centered will 
always be raised in debate—reliability, validity, ethical questions 
regarding the use of test data, and test-taking behaviors—and the 
historical trends discernable in the resolution of their surrounding 
issues are of interest. But the trends are really only of historical in- 
terest, for the most part. And even the undoubted historical value of 
this volume has of necessity been somewhat weakened by the un- 
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availability of papers presented before 1946, since the annual Pro- 
ceedings were not published prior to that date. 

Here then, is a collection of highly interesting documents illus- 
trating trends and cross-currents in the history of test usage and de- 
velopment. Yet one can hardly recommend them, except to the 
connoisseur of collector’s pieces. 

With the material with which she was confronted, the editor and 
her advisers have done a remarkable job. Not that the material was 
poor copy—far from it. But the very generality of the content; the 
obsolescence of some of the issues; and the need to provide for a 
heterogeneous-interest consumer must have made the task a difficult, 
one. 

The contents are sectioned into three major parts: Test Devel- 
opment and Use; Psychometric Theory and Method; and Special 
Problems in the Assessment of Individual Differences. 

Within Part I (Test Development and Use) are four chapters. 
The first, “Goals and Functions of Testing” contains papers by 
Ebel, DuBois, Gustad, Tyler and Cronbach, ranging from a too 
brief (scarcely sociophilosophical, at any rate) look at the social 
consequences of testing, to a light description of the place of decision 
theory in testing. Most of these papers stand adequately on their 
own—but they would certainly need deeper, supplemental reading 
to be of much utility. The second chapter, “The Educational Con- 
text,” is one of the more historically-oriented, “this-is-what-we- 
were-doing-then” chapters. It contains papers by Tyler on the 
growth of concern for curriculum innovation and development (an 
interesting contrast, scarcely ten years old); by Holland on teaching 
machines; by Pressey on school acceleration practices; and by Stuit 
on measuring the quality of a college. It is hard to imagine an audi- 
ence for whom all of these would be vitally interesting—and yet 
they are, each and all, highly intriguing in the light of subsequent, 
events. It says much for the authors that their views can still be read 
with this kind of pleasure. The third chapter of the section, “Im- 
proving Construction and Use of Educational Tests,” is concerned 
for the most part with exercise writing (papers by Diederich, Ne- 
delsky, and Englehart), and includes a paper by Ebel on college 
examination services and one by Lennon on the test manual. Na- 
turally enough, the specificity of these papers makes them not all 
things to all men, and the paper by Ebel is one that seems to be one 
of the few dubious choices by the editor. The final chapter in the 
part is also of more specific interest. It contains papers by Kelly and 
by Hubbard on medical education; by Frederiksen on his In- 
basket Tests, and by Ryans on predicting teacher effectiveness. To 
each, his own, and not having especial interest in testing in the pro- 
fessions does not detract from the contributions. Н 

Part II, on Psychometric Theory and Method, also contains four 
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chapters. Interestingly enough, the "vintage" of these papers is 
somewhat older than that of the other sections: no causal inferences 
shall be drawn. The first chapter of this part is titled “Norms, Units 
and Scales,” though the papers (by Lennon, Angoff, Cronbach, Lind- 
quist and Gardner) do not stray far from the norms question. 
Again, the issues raised are vital, but the discussion leaves the 
reader more frustrated than satisfied. Some questions are better 
answered when discussed in fullblown technical glory. The follow- 
ing chapter, “Reliability, Validity and Homogeneity,” which con- 
tains contributions by Thorndike, Gulliksen, Carroll, Saunders, and 
Anastasi, is somewhat more appealing, perhaps because at least 
some of the questions raised can be verbalized upon. The third 
chapter of the section evinces some discomfort—overtly declared, 
in one instance—on the part of the authors in trying to cope with 
essentially highly quantitative concepts, in verbal fashion. The 
title of the chapter is “Multivariate Procedures in Test Construc- 
tion,” and one can but sympathize with the contributors—French, 
Tiedeman, Eysenck, Kaiser, and Tucker. That their papers are so 
readable is a tribute of itself. And finally in this second section 
comes a chapter on “Nature and Development of Intellectual 
Traits.” Bayley makes a general contribution in her paper on the 
curve of intelligence; Davis, Turnbull and Carroll have papers on 
the Verbal Factor; Thurstone, Guilford and Thorndike, on creativity. 
Somehow, one is left with a vague “so what?” kind of feeling, in 
terms of substance. Yet the historical trends have been cleverly 
laid bare. 

Finally come three chapters on “Special Problems.” Chapter 10 
of the book is on cultural differences in test performance. In papers 
ranging from two 1949 to two 1964 contributions, Anastasi, Turn- 
bull, Lorge, Rulon, Fifer, and Wolf have hit upon some troublesome 
issues. A better integrated collection by Hanfmann, MacKinnon, 
McClelland, Crutchfield, Loevinger, and Messick on personality 
testing illuminates the authors’ interests and predilections in the 
field. And the volume finishes its presentation of papers with one of 
the better chapters—Tucker, French, Lorge, Sanford, McArthur 
and Zubin on judgmental assessment procedures. 

The book is rounded-out with a complete list of conference pa- 
pers, and a subject and an author index. 

If one believes the dust-jacket claims that this collection “has 
been carefully organized to be of significant use to the student con- 
cerned with psychological testing, test construction, multivariate 
analysis, and the educational use of tests, and to the professional 
worker in tests and measurement who wants to keep abreast of de- 
velopments in the field,” one is schizophrenically both doubtful and 
in agreement. Probably, the book does satisfy some of the wants of 
both groups. It gives a simple outline of several techniques that in 
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their original form would be beyond the reach of some of the po- 
tential readers. Yet at the same time there is not quite enough sub- 
stance to save the specialist from having to turn to more technical 
or substantive sources. And at ten dollars for the volume, this is a 
fairly expensive dalliance! 

However, the book will no doubt find considerable support as an 
historic document. Some people will be admirably served by its 
survey of problems that have plagued us over the years. And we 
shall all like to have such a volume on our bookshelves. After all, is 
there not something of the snob in us all? 

PETER À. TAYLOR 
Rutgers University 


An Assessment of Quality in Graduate Education by Allan M. 
Cartter. Washington, D. C.: American Council on Education, 
1966. Pp. xvi + 132. $5.00. 

In 1963, the American Council on Education (ACE) established 
a Commission on Plans and Objectives for Higher Education whose 
function was to be the continuing study of long-range problems 
facing colleges and universities, and to make policy recommenda- 
tions where such were possible and appropriate. One of the early 
questions to which the Commission addressed itself was: “What are 
the present strengths and weaknesses of our graduate schools in pro- 
viding well-trained scholars for both teaching and research?" Cart- 
ter’s monograph is a report on the attempts to answer this ques- 
tion. 

The ACE’s interest in the problem of assessment of the quality of 
graduate education was probably stimulated by a desire to vali- 
date (or invalidate) the academic grapevine, which provides the 
public an impressionistic evaluation of the nation’s institutions of 
higher learning. The known diversity of the American system car- 
ries with it the danger that reputations may become fixed. As Ries- 
man so aptly says (Constraint and Variety in American Education, 
1958, p. 5): и 

“|. colleges that have developed а novel or more demanding 

program cannot get the students to match it, while other institu- 

tions that have decayed cannot keep away students who should 


no longer go there.” . à Mio) 
of a projected series of investigations at 


The present report is one Series ol : 
five-yearly intervals providing the public with information about 


changes in the quality of some university offerings—the kinds of 
change that might be overlooked by prospective consumers. Ob- 
viously, university faculties and courses are not static. The ACE as- 
sumption is that only by conducting continuous evaluation will it be 
able to keep the public properly aware of relative quality-shift. | 
“Quality” is a difficult attribute to define and to measure. There is 
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no single, universally agreed-upon index (or indeed, combination of 
indices) of it. Operationally, quality is a value judgment. It is as- 
sessed subjectively. Like their predecessors, the results of the pres- 
ent study (necessarily) rest ultimately on opinion. But the au- 
thor and his Commission have attempted to minimize the effects of 
subjectivity by careful selection of "informed" respondents and by 
specification of the ratings requested from them. The range and 
number of institutions selected for study were increased to be more 
nationally representative than in previous studies. The Cartter re- 
port, “using chairmen, senior scholars, and junior scholars in 106 
universities, is much less vulnerable to the charge of institutional 
and regional concentration.” 

One important consideration in opinion survey techniques is the 
qualifications of the respondents. Another is the assumption that the 
greater the consensus, the more probable it is that opinions repre- 
sent fact. For most of the university offerings that were studied, at 
least one hundred individual ratings were obtained. When these 
ratings were grouped according to the age and/or rank of the re- 
spondent, a high degree of consensus (Spearman rank correlations 
greater than .70) was obtained. There was less agreement between 
régional groupings of respondents. This was considered due to dif- 
ferential knowledge of local institutions. Other methods were used 
by the Commission to assess the reliability of their results, For ex- 
ample, they used an independent study of the ranking of Political 
Science Departments. They correlated the rankings of faculty and 
of effectiveness of the program with objective indices such as publi- 
cation records of scholars, and faculty salaries. In each instance, the 
relationship was high. An interesting internal check was provided 
by a clinical error—Harvard, Cornell and Stanford, which do not 
have formal doctoral programs in geography were mistakenly left 
on the list of departments in that field and each was ranked as being 
of “insufficient quality to grant the doctorate.” 

Thirty academic disciplines were included in the study. Selection 
was based on the criteria of the subject’s representativeness of the 
major areas in Arts and Science, and on the provision of a high 
degree of overlap with earlier studies so that comparisons were 
possible. Psychology was grouped with the “Biological Sciences”; 
Education was omitted as a separate discipline. Conceding the 
difficulty of defining Education as a field of study, its omission is 
interesting in view of the fact that between 1953 and 1962, Edu- 
cation was second only to the Physical Sciences in the number of 
doctorates awarded. Nevertheless, the academic disciplines in- 
cluded in the present study account for about 75 per cent of all 
Ph.D.’s awarded during the decade preceding the study. 

About 200 universities grant the doctoral degree in the United 
States today, but 95 per cent of all earned doctorates come from the 
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largest 100 institutions. The ACE sample comprised the 100 uni- 
versities forming the Council of Graduate Schools as of 1961, plus 
six institutions that had granted over 100 doctorates in this pe- 
riod 1952-63. 

As for the respondents, the sample was quite heterogeneous, being 
consciously selected to represent the “informed opinion” of depart- 
ment chairmen; “distinguished senior scholars”; and “knowledge- 
able junior scholars.” The sample was weighted by the size of the 
department as measured by the number of doctorates awarded in 
the time period considered. The deans of the 106 graduate schools 
were asked to provide lists of potential participants, and a 100 per 
cent reply provided names of 5,400 people, 4,250 of whom responded 
to the questionnaire. A total of 4,008 usuable replies was obtained. 

The questionnaire asked two basic questions: to rate the quality 
of the graduate faculties on a six-point scale, and to assess the ef- 
fectiveness of the doctoral program on a four-point scale. Each judge 
was asked to rate only his own field. An approximately equal per- 
centage of returns was received from each of the three age-and- 
rank groups. | 

The results of the surveys are by now well-known. Any academi- 
cian would find much fascinating reading, and a provocative sourte 
for many an argument, in the tables. It is a rare category when 
Harvard and the University of California at Berkeley do not appear 
at, or near, the top of the list. Alumni of many institutions will, 
however, be rewarded in their search of the top twenty names in 
each category, for although the same few names tend to occupy the 
highest positions, there is considerable variability as one reads 
down the list. 

Results of opinion surveys will always be equivocal. The value of 
a study such as this may also be questioned. But the reasons given 
by ACE for doing this assessment would seem to justify its doing. 
And further, the intent to repeat the survey every five years con- 
veys an added value. They, and we, expect that the informa- 
tion-gathering techniques will aa nd with each successive survey, 

ielding increasingly objective results. 
: The fact that fecultios are asked to make ratings of departments 
in their field on a regular basis could be anticipated to lead to & 
closer scrutiny of their own offerings, with presumed subsequent, 
and accelerated improvement. | A 

Problems associated with halo effects, response bias, and timelag 
in publieation of results will probably continue to plague future 
reports. Some declaration of the representativness of specialties 
within a field would help in assessing the validity of the survey. : 

Although the precise relevance of the report to Psychology, ani 
particularly to Education, is at best partial, the report bani i 
teresting—perhaps even vital—reading. As an illustration 0 
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conduct of an opinion survey, the report is near exemplary, even 
acknowledging that it will not be all things to all men. 
Ретев A. TAYLOR 
Susan F. Forp 
Rutgers University 


Foundations of the Theory of Prediction by William W. Rozeboom. 
Homewood, Ш.: Dorsey, 1966. Pp. viii - 628. $10.50. 

The present volume is one of a series of scholarly, technical books 
on psychology published by The Dorsey Press over the past few 
years. Having perused several of the books in this series, the re- 
viewer is of the opinion that their general format, and particularly 
the typeset, will not be greatly appealing to the typical student, and 
the fledgling undergraduate in particular. 

Foundations of the Theory of Prediction offers no exception to 
this general impression. It can be predicted with 95 per cent con- 
fidence that a simple thumbing-through of its pages will frighten 
off many of the advanced undergraduates and graduates for whom 
it is reportedly intended. For example, the number of formulae and 
symbols is absolutely staggering! The author’s cavalier assurance 
that “... the amount of mathematical skill required to master the 
material is quite nominal—all that is needed is a little elementary 
(high school level) algebra and some receptivity to mathematical 
thinking” will offer cold comfort to mathematical unsophisticates 
who venture within the cover of this book. 

According to the 1966 APA Directory, Dr. William W. Rozeboom 
received his Ph.D., from the University of Chicago (1956) and then 
traveled to Minneapolis for a two-year NSF postdoctoral at the 
University of Minnesota. Subsequently he taught at St. Olaf Col- 
lege, Wesleyan University, and the University of Alberta, where 
he is presently employed. Dr. Rozeboom’s areas of specialization 
are behavior theory, formal methodology, and the philosophy of 
science, 

The author's interest in philosophy may afford some insight into 
the raison d’étre of his writing style. Many of the sentences in 
Foundations . . . are absolutely Proustian in length, and the reason- 
ing, although profound and lucid in many places, tends to be exces- 
sively pedantic and unreasonably difficult. Perhaps it would have 
been better to have spent the first part of the book on the fundamen- 
tals of calculus and matrix theory rather than relying so much on 
complicated algebraic and intuitive arguments. Like the modal 
philosopher, the author likes words and knows how to use them. 
But unlike the philosopher, he has a sense of humor. Unfortunately, 
his sense of humor, rather than offering a respite to the dutiful tra- 
vail of the reader, frequently gets in the way of the reader’s attempt 
to understand the point being made. 
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Enough of a critical nature! The reviewer concluded the book 
with a feeling that his labors had been worthwhile, and perhaps that 
is a sufficient tribute to any author. In any event, the book consists 
of 10 chapters, appendices on the normal distribution and matrix 
derivations, references, glossary, name index, and subject index. The 
first four chapters comprising Part I cover the basic statistics of 
prediction, emphasizing the linear regression model. Chapter 4, “Sta- 
tistical Regression and the Theory of Prediction,” is a lengthy (126 
pages) discussion of correlation and regression containing a more 
thorough treatment of regression models and error theory than 
the reviewer has yet to encounter in a book of this sort. Part II, 
“Test Theory: The Assessment of Predictors,” deals with validity, 
factor analysis, reliability, a unique and useful chapter on the vari- 
ance structure of composite tests, and an epilog chapter which con- 
tains information on sampling, probability, nonlinearity, and other 
special problems in test theory. 

In summary, in the reviewer’s opinion this book will appeal more 
to the teacher of testing courses and certain gifted graduate stu- 
dents than to the typical student who enrolls in a course on test 
theory. In addition, it should prove to be a valuable reference work 
and a provocative source book for such a course or for anyone in- 
terested in the complexities of the theory and statistics of prediction. 

Lewis R. AIKEN, JR. 
Guilford College 


Psychological Statistics: An Introduction by Frederick A. Courts. 
Homewood, Illinois: The Dorsey Press, 1966. Pp. x + 349. $7.50. 
Psychological Statistics: An Introduction. could be described asa 
junior version of Quinn MeNemar's Psychological Statistics in the 
sense that Courts’ text has the quality of MeNemar's exposition 
and some similarity in choice of topics. Yet it is not simply a scaled 
down version of McNemar’s popular intermediate text. The choice 
of topics and emphasis will appeal to many who teach the first 
course in statistics in the traditional format. 

The introductory chapters briefly treat scales of measurement, 
discrete and continuous variables, intervals and significant digits 
(11 pages). Frequency distributions and graphs are treated con- 
cisely in 19 pages in Chapter 3. The presentation is adequate and 
avoids the discursiveness of many introductory texts. Chapter 4 (19 
pages) covers measures of central tendency and includes a section 
on the single summation operator. Also included in this chapter is а 
five page section on computing the mean from an arbitrary origin. 
Although one may argue that this topic unnecessarily confuses the 
student and is no longer needed with the availability of desk caleu- 
lators, these are not universally available, and some instructors may 
feel the topic provides insight into data handling techniques; in 
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other situations the section may be easily omitted. A minor criticism 
can be directed towards the partial failure to relate the various 
measures of central tendency to the relevant scales of measure- 
ment. 

Chapter 5 (26 pages) treats measures of dispersion. Astound- 
ingly the groundwork for the standard deviation is established with 
Tchebycheff’s Theorem in an appealing presentation which is ap- 
preciated, unfortunately, only by those students with a moderate 
degree of mathematical sophistication. Percentiles are treated ade- 
quately for student understanding but not for computational facil- 
ity. Range, inter-quartile- and semi-interquartile range are disposed 
of briefly. Again, seven pages are devoted to computing schemes in- 
volving arbitary origins, units, and problems of grouping. 

A thorough treatment of linear regression (Chapter 6, 19 pages) 
precedes the introduction to correlation. The regression chapter is 
difficult for those students who have not been exposed to the funda- 
mentals of analytic geometry and may seem excessively long and 
detailed for a semester or one-quarter course. The product-moment 
coefficient receives extensive treatment in Chapter 7 (29 pages). The 
interpretation of r is covered in 10 pages reminiscent of McNemar’s 
approach. Included in this chapter is a three page illustration of the 
scatter-plot method of computation involving arbitrary origin and 
unit, Again, the comments which were made above with respect to 
the computation of the mean with an arbitrary origin can be ap- 
plied. However, there is justification of the scatter-plot for visual 
inspection of the data beyond the value for hand computation. 

In Chapter 8 (26 pages) the concepts of probability, permuta- 
tions and combinations, and the binomial distribution are reviewed 
in a rather breathless fashion which probably has minimal benefit 
for the student with little or no prior exposure to these topics. The 
remainder of the chapter (11 pages) contains a good discussion of 
the normal curve and the use of the normal tables. There is a brief 
introduction to the normal approximation to the binomial. 

The basic ideas of statistical inference are discussed in Chapter 9 
(39 pages). Sampling theory is presented in an exciting demonstra- 
tion method which is marred only slightly by an uncomfortable 
notation and one uncharacteristically muddy paragraph. The com- 
parison and meaning of the descriptive index of the standard devia- 
tion (S = /22?/N) and the relatively unbiased population estimator 
(s = Vzai/N — 1) is beautifully handled in about four pages—a 
discussion which virtually eliminates the typical confusion displayed 
by students about this topie and plants the seeds for a later under- 
standing of the concept of biasedness. The chapter concludes with 
а variety of topics including confidence intervals, an introduction 
to the í-distribution, and a nondevelopmental Presentation of 
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standard errors for the basic statistics and others which are not 
commonly encountered in an introductory text. 

The testing of hypotheses by single samples is covered in Chapter 
10 (19 pages). Concepts in formulating hypotheses, selecting & 
statistical test, recognizing decision errors (Type I, Type II errors), 
carrying out one-tailed versus two-tailed tests, and setting up the 
critical ratio are discussed in an interesting and clear manner. 

Group comparisons (Chapter 11, 24 pages) begins with the t- 
ratio for correlated groups which is generalized convincingly to the 
t-test for random groups. The treatment is excellent but would best 
be supplemented with additional examples from the instructor. The 
sign-test is also presented. The two short sections on the standard 
error of a difference between two measures and testing differences of 
dispersion with the normal curve table seem somewhat superfluous. 
The last section of the chapter introduces the F-distribution to test 
variance ratios and to compute confidence intervals of o°, 

Chapter 12 (24 pages), entitled frequency comparisons, intro- 
duces chi-square and testing of the differences between proportions, 
The section on chi-square, is quite complete, which includes a com- 
parison of chi-square, the sign-test, and McNemar’s test when ap- 
plied to the same data. There seems to be little justification for the 
section applying x? as a test of normality when in Courts’ own 
statement the Kolmogorov-Smirnov test is superior. Some other 
example of the application of chi-square as а test of goodness-of-fit 
would be more appropriate. The median test and the computation 
of exact probabilities are also briefly surveyed. 

Analysis-of-variance is introduced in Chapter 13 (31 pages). The 
section on basic concepts of the partitioning of the sums of squares is 
good. The testing of the differences of means between two groups is 
smoothly generalized to several groups in a manner that relates t to 
F. Since no material is presented for the double summation operator, 
many students will require some supplementary material nt this 
point, Next, the analysis-of-variance is applied to testing of differ- 
ences between correlated means and to the principles of 
factorial designs. 

Chapter 14 (8 pages) continues the discussion of the analysis-of- 
variance for the computation of the correlation ratio, +}. This isa 
worthwhile topic for longer introductory courses to counter “linear 
thinking” that is produced in the beginning student by virtue of the 
usual emphasis on the product-moment coefficient. 

The final chapter (15 pages) quickly scans other correlational 
techniques which are common in the literature. A quick overview of 
this chapter will be valuable for the student in his later readings and 
particularly appropriate for those who have interests in psychologi- 
cal testin; 4 

The раак contains an adequate selection of tables. 
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The negative criticism of the first printing of this text must begin 
with the number of typographical errors. Although none of the dis- 
covered misprints is extremely serious (with the exception of the in- 
terchanged headings of the F-tables for the 2.5 and the 5 percent 
points), and although in most cases misprints are easily detected 
(cf: the number of heads in the table on p. 131, and the position of 
the radicals in the second line of p. 149), misprints are none-the-less 
annoying, if unavoidable, in today’s pressure publishing. 

In preparing a course syllabus, the instructor will encounter two 
difficulties. First, the sections of the text are not numbered for easy 
reference, and some long sections (for example on x?) are not clearly 
separated such that parts may be conveniently assigned independ- 
ently. Second, although the author maintains that statistical in- 
ference may be approached earlier in the course by postponing 
regression and correlation, in practice a reordering of the text in this 
manner is difficult because of a greater dependence of the inferen- 
tial material on the postponed chapters than is immediately ap- 
parent. This difficulty is increased by the absence of section 
numbers. 

Many of the exercises provided in the text are excellent and il- 
Idstrative. A number of the exercises are too long for the class with- 
out calculator facilities. Instructors who face this situation, and feel 
that some computational facility is a desirable goal of an introduc- 
tory course, would do well to provide short exercises in place of these 
longer problems, A brief exercise text could be a useful supplement 
to Courts’ text provided that the thought-provoking questions sup- 
plied by Courts were also assigned as well as some exercises that re- 
quire problem lay-out. 

The instructor’s manual, which is extremely brief, suffers from 
typographical as well as content errors in several places. Many of 
the examination questions supplied by the author are more appro- 
priate for advanced rather than students and are not suitable for 
objective examinations for large classes. (No answers are provided 
for examination questions, since most are of the discussion type.) 
The novice instructor will find the manual of use only as a dubious 
check of his own solutions of assigned problems. 

Although Courts states that the text can be mastered with a 
knowledge of algebra only, one finds that those students who have 
avoided higher level mathematics courses usually have a reason for 
doing so which somehow affects their performance in an introduc- 
tory statistics course. 

Obviously the text is much more than a basic minimum; perhaps 
it could be described as an intermediate level introductory course, 
midway between a cook-book and a first course in mathematical 
statistics. Proofs which are given, for example, are neither the most 
rigorous nor the most intuitive. The text has many positive features 
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which will make it a strong competitor in the rash of new intro- 
ductory statistics textbooks. The breadth, depth, and clarity of the 
text, beyond the minor criticisms above, are amazing. The compe- 
tent student will find that the volume will be a useful reference for 
review as he approaches advanced courses. The notation is con- 
sistent and where no standard form has yet been adapted, com- 
peting notations are clearly related. 

Although the text will probably not be selected by those instruc- 
tors who are offering a course jointly in several departments, it will 
be most appealing to those who direct their classes towards the prep- 
aration of professional psychologists. The instructor who finds 
some other text more appropriate for his classes will find in many of 
Courts’ discussions a presentation that could well be emulated in 
lecture; the text would also serve well as a source book for in- 
structors in introductory psychological statistics. 

Finally, the book is well bound, of good paper, and has a legible 
type face. 

Tuomas А. TYLER 
University of Illinois 
Chicago Circle 


Handbook of Mathematical Psychology by R. Duncan Luce, 
Robert R. Bush, and Eugene Galanter, Editors. New York: 
John Wiley & Sons, 1963. Vol. I. Pp. xiii + 491. $10.50. 

The three volumes of the Handbook and two volumes of Read- 
ings in Mathematical Psychology cost $52.30. Since the five vol- 
umes are neither self-contained nor exhaustive, the editors assume 
that “their readers have or have easy access to” eight volumes of 
foundational mathematics and thirty-nine “basic” reference books. 
Clearly, graduate education has become monetarily so demanding 
that both wife and “kids” will have to earn while “daddy” learns on 
the royalty road to a Ph.D. in psychomathonomics. 

Readers acquainted with the literature of mathematical psychol- 
ogy must by now expect the amount of redundancy that exists 
within the Handbook, Readings, and basic references. Apparently, 
only a few mathematically tractable themes are currently popular, 
but they have as many variations as there are mathematically in- 
clined individuals in need of a dissertation, publication, or (re)* 
publication (with n a non-negative integer rarely equal to zero or 
one). Besides redundancy, the contents reflect a number of the im- 
plicit taboos of the mathonomie culture. Never, never must the 
psychomathonomist conduct his own research or analyze his own 
data, unless of course the number of subjects is equal to one. Con- 
trol groups are out, since the only alternative to the theoretical 
expectation is & good fit or some lesser degree of it. Experimental 
events, if they exist, are not to be confused with “point-at-ables,” 
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for only the "reals" are for real. Instead, they are hypothetical con- 
cepta, referred to as empirical relational systems, presumed to medi- 
ate the numerical assignment responses of experimental psycho- 
mathonomists, although in the final analysis the “type” of response 
involved “is ultimately a theoretical decision.” And finally, besides 
the redundancy and taboos, the contents reflect a ritualistic pre- 
occupation with hagiolatry and folklore. 

In the northeast hyperplane of Psychomathonoma, within the 
confines of the Glb and Lub, rises Mount Olemmas fabled abode of 
the greater mathonomic gods Cebysev, Poincare, Markov, Laplace, 
and Bernoulli (whose functions are fundamental as opposed to 
Bogoljubov, Bunjakovskii, and others whose functions are primarily 
but wondrously ornamental) and lesser psychonomic gods Weber, 
Fechner, Stevens, and Thurstone. Whether greater god or demigod, 
each sits astride the summit sampling monogenized tagmemics, a 
friulian formulation from the files of Ajdukiewiez. 

The folklore easily matches the hagiology in exoticism of nom- 
enclature, The Saint Petersburg Paradox, Gambler’s Fallacy, Pris- 
oner's Dilemma, Lady or the Tiger, and Missionary’s Downfall 
(possibly an apoeryphon), being cases in point, are told, retold, and 
tolled again to captive-sated audiences on all ceremonial and con- 
vivial occasions. 

As the psychomathonomie publications appear, then appear and 
appear again, the discerning imprintee becomes increasingly aware 
that mathematical psychology, as in the case of statistics, analysis 
of variance, factor analysis, information theory, or psychometrics, 
consists of little more than a variety of routines for “examining the 
fine grain structure” of real, or not infrequently hypothetical, data. 
He discovers that mastering the contents of the Handbook, Read- 
ings, and basic references no more prepares him to develop a re- 
search problem of theoretical significance than does mastering the 
contents of current texts on analysis of variance. Having become 
proficient in either technique of data analysis, he can replicate the 
research cited in the texts and even use the mathematical models to 
motivate major methodological changes in the original experimental 
procedures. But it he is to introduce nonmethodological variations 
into the original research design, he ordinarily must derive them 
from theoretical, intuitive, or empirical considerations extrinsic to 
the models (i.e., axiomatic statements) of mathematical psychology. 
For nearly all of the substantive questions considered in the context 
of mathematical psychology are borrowed directly from nonmathe- 
matical theories of behavior or represent nonmathematical inter- 
pretations of results obtained from experimental research conducted 
outside of the context of mathematical psychology. The kinds of 
questions that mathematical models ask of the data have a status 
identical to those posed by the models of analysis of variance. In 
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fact, if Hull and Fisher had been one and the same person and had 
introduced analysis of variance as the theory of behavior under the 
assumption that H + D + K +V +... + error = E, behavioral 
science would have had its Newton and suffered a setback from 
which it might have required weeks to recover. Luckily, Fisher was 
not a mathematical psychologist or, more to the point, Hull did not 
invent analysis of variance. 

None of the above observations is intended to discredit the enor- 
mous contribution to methodology that mathematical psychologists 
have made during the period from 1950 to the present time, Tn prin- 
ciple, furthermore, the symbiotic relationship existing between the 
mathematical and the experimental or theoretical psychologist is in 
the highest interests of behavioral science. In actual fact, however, 
the demands upon an individual's time have led to overspecializa- 
tion and that in turn to considerable cross-cultural lag particularly 
at the points (the area of motivation provides a good example) 
where the mathematical models are presumed to be noncommital. 
The reviewers’ impression is that, theoretically and experimentally, 
psychological developments stopped in 1950 as far as some mathe- 
matical psychologists are concerned. In many instances this lag 
may reflect not only ignorance but also value judgments, a comment 
that applies both ways. 

Chapter 1. Suppes, P. and Zinnes, J. L. Basic Mi easurement The- 
ory: The authors offer a solution to a problem derived from the orig- 
inal question of В. 8. Stevens: namely, if a scientist represents his 
experimental observations and results in the form of numbers, what 
mathematical operations can he perform on these numbers without 
unwittingly introducing misinformation into the conclusions? The 
problem is to invent a formal system and then restate and answer 
the question within the context of this system. 

The solution begins with a formal definition of three concepts, 
scale, scale type, and meaningfulness. Paraphrasing portions of the 
seventy-three page article very loosely, the argument proceeds as 
follows: First, the scientist assigns formal properties to the empiri- 
cal operations used in a given measurement procedure. The chance 
of satisfactorily completing this initial step depends to a large de- 
gree upon the scientist’s intuition, experience, and mathematical 
resources; the authors provides many examples of the finished 
product, To illustrate, let а, day + « « » O Gj Gx» « > On enote dif- 
ferent brands of cigarettes distinctly different in mildness and a, Pay 
signify a smoker's judgment that "brand a, is milder than brand 
аһ If the smoker judges “brand a, is milder than a," whenever he 
judges that “brand а» is milder than c; and a, is milder than ам 
then a formal characteristic of P is that it is а transitive relation. 
Upon further deliberation the scientist may ascribe addi р 
properties to P such as asymmetry (if ajPa; then not asPaj) 
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connectedness (if а; 76 a, then either а;Ра or a,Pa,). Letting a e A, 
the couple U = (А, P) symbolizes an empirical relational system 
and summarizes the developments up to this point. 

Second, relying again upon training, experience, and intuition, 
the scientist associates the ау, а, ete. with numbers (the positive 
integers, for example) and the relation P with an "appropriately 
chosen" relation on numbers (the arithmetic relation “less than" 
or <, for example) until by trial and error he obtains a “numerical 
relational subsystem" that he can formally prove has properties 
parallel (isomorphic or homomorphic) to those of the empirical 
relational system. The couple (N, <), where N denotes the posi- 
tives integers, is a numerical relational subsystem isomorphic to 
(A, P). This step is called 

Suppes and Zinnes write at times as if the empirical relational 
system were the objects and judgments themselves. Representation, 
then, would be an assignment of numbers to objects such that the 
similarities or differences among the numbers demonstrably re- 
flected the judged similarities or differences among the objects. 

The definition of a scale and the definition of a scale type rest 
on a subtle distinction, between a numerical relational system and 
ita subsystems. If R is the set of real numbers, № the set of posi- 
tive integers, < the arithmetic relation “less than,” then the order 
pair R= (R, <) is called a numerical relational system while (N, 
<) is called a numerical relational subsystem because < is re- 


morphism between (N, <) and (A, P) by the symbol f, the authors 
define а scale as the triple (Ц, R, f). A scale, however, is not a 
scale type. 

Third, in order to define the scale type (uniqueness), the scien- 
tist finds a second subsystem of (R, <) isomorphic to (A, P), 
namely (U, 9t, g). The two functions f and g ordinarily assign dif- 
ferent sets of numbers (i.e., subsets of the reals) to the objects; 
they are related however by а rule ¢ that transforms the numbers 
assigned under f into the numbers assigned under g. In the “milder 
than” example, ¢ is a monotone transformation, i.e., f (a) < f(a) 
if and only if ¢ ч) ыы (f (on). Suppes and Zinnes define 

= (f, 9,- --, h, ... ] of numerical assign- 
menta that map a given empirical system isomorphieally onto sub- 
the identical numerical relational system. If the nu- 
т аге related by a monotone transformation, as 
in the above example, the seale type is called an ordinal scale. 


a Er of no consequence (p. 15)." In fact, 
the selection of a numerical system for a given e ind system is 
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to some degree arbitrary; alternative numerical systems may be 


by Stevens. There is no simple “yes” or “no” answer to the 

of whether a mathematical operation, such as addition, can be used 
with a particular scale or not. Instead, the admissibility of an oper- 
ation depends on how the operation is used in а numerical state- 
ment. For example, let т, у, and 2 be real numbers, then the state- 
ment т -+ у > z is a meaningful use of addition if 2, y, and 2 lie on 
а ratio scale, while the statement т + y > 2 is not. The reason is 
that an admissible similarity transformation of the ratio scale, €. 
zio кт, y to ky, and z to ke, where k is a positive real number, does 
not change the form (and therefore the truth value) of the first 
statement since the k’s cancel, but does change that of the second, 
In brief, “the meaningfulness of numerical statements is 

solely by the uniqueness р! ies of their numerical assignmenta, 
not by the nature of the operations in the empirical or numerical 
systems (p. 73)." Two related points are that (1) manipulation of 
the numbers can involve relations other than those in the given 
numerical system and (2) although the domain of a numerical sy- 
tem is the set of real numbers, the numerical system need not be iso- 
morphic to the real number system. 

A reading of Chapter 1 is really incomplete unless followed by а 
reading of Adams, et al. (1965). These authors bring Suppes and 
Zinnes back to earth at the point of empirical (scientific) signifi- 
cance, They suggest “ап alternative way of empirical 
meaningfulness which Їз... independent of , . . permissible trans- 
formations, That is, a formula may be described as empirically 
meaningful . .. in the sense of being definable in terms of the em- 
pirical opera on which measurement theory is 


tions and 
based ” They suggest ns "a matter of conjecture 
haie is in fact more fundamental 


They also ly the Ачуу. 
. 118)” also apply 
ee posers 2 s evaluating numerical statements (for- 
mulae) involving ordinary statistical operations. Their ri. 


analysis, which appears “. . . consistent with Stevens's , T 
eliminates much of the confusion that has been created in the course 


of the controversy over the question 
mathematical operations and the meaningful question of admissible 


numerical statements. 
may have been around for quite 
Although many of the ingredients Я 


: 1 was pu 
Meque Chapter 1 is landmark. It reduces a long 


standing “puzzlement” to what now appears obvious. 
Chapter 2. Bush, R. R., Galanter, E., and Luce, R. D. Characteri- 
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zation and Classification of Choice Experiments. After going over 
this short chapter, the reviewers decided that only minimal comment 
is required, as it sets forth primarily definitional material. Emphasis 
is placed upon the abstract structure of the choice experiment in- 
volving the classical stimulus—response model. A helpful flow 
diagram of the choice experiment is furnished. In addition there 
are sections concerned with (a) ways for classification of the en- 
vironment, (b) instructions, pretraining, and identification func- 
tion, and (c) problems of definition and classification of choice 
experiments. A useful summary of the concepts of the choice experi- 
ment is presented in a table. 

Chapter 3. Luce, R. D. Detection and Recognition: This chapter 
contains Luce’s interpretation of threshold (neural quantum) the- 
ory, signal detectability theory, and his own choice theory. With 
regard to neural quantum theory, Luce proves that the argument 
motivating most of the theoretical research is specious. Proponents 
of the theory have long maintained that by implication the transi- 
tion from 0 to 1 of the probability of detection should be uniform 
not sigmoid as quanta units increase. However, Luce concludes that 
the “. . . theory as presently stated does not really make this pre- 
diction (p. 160).” Considering the august company that accepted 
the original prediction, not to mention the inviolability of the ac- 
creted dogma, Luce’s analysis should bring smiles in some quarters. 

In order to heal the wounds, Luce develops a valid quantum hy- 
pothesis “that the p = 1 and p = 0 intercepts stand in the ratio k: 
(k — 1), where k is an integer (p. 160).” The reference is to plots of 
percentage of increments detected to size of increments. By clever 
use of the mean-value theorem, he argues that even though this 
prediction depends on the physical measure of the stimulus used, 
the confounding effects should not be large enough to matter. Data 
obtained from a single subject and relayed from Bekesy to Stevens 
to Luce support the intercept ratio prediction. This fact, according 
to Luce, is “the main challenge . . . for those who do not believe that 
thresholds exist . . . (p. 164).” Other analyses are applied to the 
signal detectability and choice theories with routine results. 

Chapter 4. Luce, R. D. and Galanter, E. Discrimination: This 
fifty-two page chapter is devoted to the shopworn topics of Weber, 
Fechner, Thurstone, and the “jnd.” Unfortunately, it is published 
too late for the centennial of Fechner’s Elemente der Psychophysik 
(1860) and too early for the bicentennial. However, the gesture is 
what counts and judging from the contents, the reviewers believe 
that Fechner will be the last to complain about the treatment of the 
following question: for what (reasonably smooth) transformations 
of the physical scale is the transformed Weber function a constant 
independent of S (the stimulus magnitude in physical units) ? 

The article is “theory rich and data poor (p. 304)” since “there 


№. -. ЖФ 


BOOK REVIEWS 5 


are relatively few data available that are suitable to test existing 
mathematical theory (p. 240).” The prospects for increased experi- 
mental research in this area appear minimal in spite of the apparent 
mathematical prosperity; a century full of “dusters” is sufficient 
evidence for most psychologists to determine that the field of psy- 
chophysics, especially if explored in isolation from the fields of 
motivation, perception, cognition, and learning, is “rather barren.” 
Realizing the importance of psychological variables, Luce incorpo- 
rates a bias parameter into his models (pp. 147-154, 224-232, 256, 
258, 260-2, 271-3, 283-8) represented as a function defined over 
responses. But “the nature of this function—its dependence on 
things that we can manipulate experimentally—is not known 
(p. 306).” 

Chapter 5. Luce, R. D. and Galanter, E. Psychophysical Scaling: 
* . . there is only one problem in all of psychology, . - - the def- 
inition of the stimulus (p. 248).” This tentative suggestion by 
Stevens introduces the third chapter in succession devoted to psy- 
chophysics and psychometrics. Discussed are similarity scales, bi- 
section scales, category scales, magnitude estimation scales, and 
psychological distance. Omitted are all of the important recent de- 
velopments in psychometric scaling (Shepard, 1966). f 

Perhaps the predominant characteristic of Chapters 3, 4, and 5 is 
the assurance and conviction with which Luce or Galanter construct 
axioms, assumptions, definitions, theorems, proofs, and other for- 
mal statements of the problems under discussion. The result is to 
create an impression best described by Boring (1929) in his classic 
comments on Herbart who ^. . . exhibited the not uncommon case in 
science in which inadequate data are treated with elaborate mathe- 
matics, the precision of which creates the illusion that the original 
data are as exact as the method of treatment (p. 249).” 

Many pages are devoted to the question of whether the numbers 
obtained by magnitude estimation (the subject assigns а, number to 
each stimulus presentation proportional to the subjective magni- 
tudes produced by the stimuli) form a ratio scale. Since Stevens 
has neither described his experimental procedures and the experi- 
mental events in terms of formal axioms (empirical relational sys- 
tems) and empirically validated the latter nor proved the existence 
of an isomorphic numerical representation and shown the group of 
transformations that characterizes all isomorphic representations 
into the same numerical system (as Suppes and Zinnes prescribed in 
Chapter 1), Luce and Galanter suggest that Stevens has not estab- 
lished the existence of a ratio scale of magnitude estimation. An 
analogy illustrates their argument. In determining whether Stevens 
is showing a profit or not, Luce and Galanter reject as criteria both 
intuition and appearances based on seat-of-the-pants style book- 


keeping. To obtain appropriate criteria, Luce and Galanter would 
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choose from among the available theories of cost accounting the 
one they deem appropriate. If Stevens's records are inappropriate 
to an acceptable application of the theory, no determination can be 
made. Stevens must either change his methods of bookkeeping or 
develop the means for coordinating his accounts with the theory; 
only then can the profit or loss be determined. However, an aesthetic 
judgment rather than the fact that Stevens may go broke while the- 
oretically showing a profit seems to be the Luce-Galanter criterion 
of appropriateness in the original, and possibly ultimate, choice of 
the theory. 

Probably very few behavioral scientists would assign the impor- 
tance Luce and Galanter do to the question of whether a ratio scale 
can be recovered from magnitude estimation. Luce suggests that 
future theoretical and experimental developments in behavioral sci- 
ence will lead psychophysicists through a preliminary psychomet- 
rie stage into the more interesting psychophysieal stage of meas- 
urement, the current measurement problems being due to the 
psychophysicist’s attempt to bypass psychometrics. “There is 
precious little point . . . in trying to establish such (psychophysical) 
relations until the response-response (psychometric) theory has been 
father carefully (developed and) tested (p. 147).” By the close of 
the psychometric era there will be two separate theories, one in 
which the parameters measure the subject’s sensitivity to stimuli 
and another in which the parameters measure response biases 
(motivational factors) that are under the subject’s control. But 
exactly why behavioral scientists should then become dissatisfied 
with these psychometric achievements and discard them in favor 
of the stone age methods, unconvincing assumptions, and experi- 
menter-subject-specfie findings of psychophysics, the authors do 
not explain. 

Chapter 6. MeGill, W. J. Stochastic Latency Mechanisms: In the 
course of the development of theoretical science, definitions of time 
and probability have undergone repeated revision and proliferation. 
Not to be outdone in this department, psychomathonomists recently 
made their bid with a 50 page (Psychological Review 1963) defini- 
tion of personal probability, a hypothetieal construct designed to 
disrupt the “phase sequences” of the “objectivist” advocates of clas- 
sical hypothesis testing. So far, however, psychomaths have not 
tinkered with time, an astonishing oversight considering the cur- 
rent slippage between the response-time models and data. In the 
behavioral sciences, “time” is a conceptual frame of reference within 
which the various experimental events occur. The trouble with the 
use of time as a measure of behavior is that response latencies ap- 
parently reflect the confounded effects of several processes that have 
been neither identified nor placed under experimental control. 

Currently the theoretical analyses of response latency are based 
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upon assumptions and methods developed outside of the behavioral 
sciences to solve non-behavioral problems. Consequently, most ap- 
plications of stochastic latency mechanisms, including those dis- 
cussed here, seem to be ad hoc, though not necessarily routine, ex- 
amples of curve fitting. One rational approach, for example, asks the 
reader to “imagine a neural ‘counter’ that accumulates an impulse 
count over time. ... When a weak stimulus is introduced, the count- 
ing rate increases. Consequently, if there are two counters, and one 
is accumulating only noise, a stimulus can be detected by observing 
the systematic divergence of the readings. .. . The time to accumu- 
late any fixed difference can be treated as a random walk along a 
semi-infinite line starting at zero and stopping with the first pas- 
sage through the criterion (p. 329).” Disregarding the question of 
whether this analogy and others similar to it are capable of neu- 
rophysiologically meaningful interpretation, the solutions one ob- 
tains to the random walk problem are contingent upon what sort of 
probabilistic process is postulated, and the choice of the process for 
latency data appears to be governed by the (a) availability and 
tractability of the mathematical models and (b) characteristics of 
the empirical distributions observed after the fact. Since the only 
independent variables postulated in these latency models are them- 
selves unobservables such as “extended sequence (s) of neural 
events, each of which is... а random delay with the same time 
constant," “series mechanisms . . . not observable,” “impulses,” 
“neural integration mechanism,” “chain of k + 1 (neural) re- 
sponses," "examination process," "storage," "secondary process," 
and "inhibitory blocking reaction," the latency models do not have 
experimental events for antecedents; thus they cannot be tested. — 
True, poor fits are possible between models and data but only in 
the phenotypic sense in which the Pearson r may be judged to be 
an inappropriate statistic for some banana-shaped data plots. To 
infer the characteristics of a hypothetical process from the fact of a 
good fit is, in light of the hypothetical nature of the antecedents, to 
offer a “scientific” explanation at the level of traditional "instinct 
theory and subject to the same criticisms. In other words, “ae la- 
tency mechanisms differing considerably in complexity may in fact 
lead to the same frequency function. . . . This.. . . implies an un- 
certainty principle governing attempts to work backward from the 
shape of a distribution of latencies to а latency mechanism 
(p. 323).” Finally, the majority of the choice latency mechanisms 
share a common trait; the stochastic processes are intractable and 
only deterministic approximations to the processes are readily han- 
dled. d 
Readers interested in latency mechanisms but unfamiliar with the 
mathematical, statistical, or quasi-theoretical (i.e. psychological the- 
ory) details and applications should read Chapter 14 via Chap- 
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ters 1-13 of the Bush and Mosteller Stochastic Models for Learning, 
the one and only classic among the 39 basic references mentioned 
earlier. An interesting, highly readable, quasi-theoretical (i.e. psy- 
chological theory) introductory text to mathematical psychology is 
Atkinson, Bower, and Crothers An Introduction to Mathematical 
Learning Theory. Both books are published by Wiley. 

Chapter 7. Newell, A. and Simon, H. A. Computers in Psychology: 
This chapter describes computer programs used in the simulation of 
“cognitive processes” and other types of behavior. There is no 
a priori reason to expect less from this approach to the problem of 
predicting behavior than from the more conventional paper-and- 
pencil strategies employed by other psychomathonomists. To the 
degree that the proponents of either technique frequent the experi- 
mental laboratory and actually participate in minimally automated 
research on living organisms, their efforts may hold considerable 
promise. In any case, Chapter 7 contains an eloquent defense of 
computers as a major tool for psychological research. 

Chapter 8. Bush, R. R. Estimation and Evaluation: The subject 
of statistical inference covers hypothesis testing and parameter 
estimation according to Bush. Why do psychologists study the first 
but neglect the second? Because they pursue primitive research of 
an exploratory nature, that is why! “But a science must progress 
beyond this stage if it is to be a mature science. Herein lies the great 
hope of mathematical model building (p. 432) ." 

“Having estimated the parameters, . . . how well (do) ...theory 
and data agree? .. . We should not ‘test? goodness of fit, we should 
‘measure’ it (p. 432)." This is the problem of evaluation. Thinking 
Should dominate formal methods. A simple plot of one's data is re- 
vealing, a fact that is obvious to an engineer or physicist but is not 
appreciated by psychologists. A fit may be really very good except 
for one data point, probably an error but possibly an important dis- 
covery. This anomalous datum should be carefully noted and ex- 
amined. Unfortunately, modern psychology has bred extremists. 
One school ignores numbers, the other ignores graphs. The student, 
however, should select the research tool most appropriate to the 
problem. 

4 Among psyehologists the term "design of experiments" is in- 
timately associated with a special set of experimental techniques, 
according to Bush. This use is curious because the psychologist 
knows he faces many nonstatistical design problems in planning 
good experiments. But he will use the analysis of variance whether 
it is sensible or not. Bush suggests that Fisher had a much broader 
conception than his followers and unlike them, was concerned with 
wasted scientific effort. 

“One seldom has an opportunity to test the axioms (of models 
directly, but when he has he should certainly do so (p. 188) i Hone 
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ever, Hull's “postulates are really empirical generalizations, and 
many psychologists have tried to test these so-called postulates by 
various kinds of experiments (Bush, 1960, p. 127).” 

As we take leave of Psychomathonoma, the many splendored 
homilies disappear into the subsets in the basis of Mt. Olemmas. 
While up on the saddle points, reflecting barriers reverberate with 
the sound of the psychomathie credo, "Models are more often de- 
stroyed by better models than by experiments (p. 467)." But not 
all of the noise from Olemmas is signal; some of it is noise. For 
something is brewing up there besides the nectar of Dionysus; the 
lesser psychonomes are rumbling. Long disenchanted with their 
demi-status, they are demanding the immediate repeal of the 
Cauchy-Schwarz and Tchebycheff inequalities. The day will soon 
arrive when every psychonome on Olemmas will have the titular 
right to replace the sound of Jones, Smith, and Hall with Melodious 
monikers like Safarevic, Krasnoselskii, Nemyckii, Cebotarev, 
Ljusternik, or Hooke. 
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Basic Statistics: A Primer for the Biomedical Sciences by Olive Jean 
Dunn. New York: John Wiley and Sons, 1964 (second printing, 
1966). Pp. xiii + 184. | 

Like pee puce the biological and medical sciences 
are in a period of rapid growth. This growth has in no small sense 
been increased by the diversification of experimental procedures, 
methods, and concepts. Biological systems are multidimensional, 
and differences between individuals are great—therefore quanti- 
tative methods in biology, as in psychology and in education, nec- 
essarily vary considerably depending on the environment, Biolo- 

gists, too, have faced similar difficulties to those statisticlans n 

psychology and education in adapting techniques from other ra 

the adaptation process is not simple, frequently indirect, and al- 
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most always impossible without substantial revision, reformulation, 
or extension. Like many psychological systems, the simulation of 
biosystems requires a high level of mathematical competence and 
the assistance of large-scale computers. And as theoretical concepts 
mature in biology, mathematical models and methods become more 
and more effective—just as they do in education and psychology. 
Perhaps an isolated point of divergence between our disciplines and 
the biological sciences is that the latter may be more reasonably 
joined with the methods of the physical sciences with their interest 
in what we do (whereas the psychological “sciences” are primarily 
concerned with how we perform). 

In balance, however, there can be little doubt that much fruitful 
exchange can be made—and already has been made—between the 
methodologies of the biological sciences and the behavioral sciences. 
For that reason, it behooves us from time to time to inquire what is 
going on in the teaching of research and statistical skills in the 
biomedical sciences, and Dunn’s book provides one of those op- 
portunities. 

Basic Statistics was designed to serve as a textbook for a one- 
semester course in statistics for students in the biomedical fields, 
and as a reference book for physicians, nurses, and public health 
workers who are involved in research itself or who have to interpret 
research results, 

Apparently, the instructor in the biomedical field faces the same 
basic problem that confronts those who teach in the behavioral 
sciences—an “inadequate” mathematical preparation on the part 
of the bulk of the student body. Every effort has been made by the 
author to keep the mathematical demands in the presentation at a 
high-school algebra level. Essentially, there are no derivations of 
formulae, the presentation is highly verbal; and whenever the use of 
symbols and formulae is unavoidable, they are carefully and pa- 
tiently explained both in words and through simple and clear 
worked examples. 

Naturally enough, the examples that the author uses are consist- 
ently taken from the biomedical areas—but they are not so unfamil- 
jar as to make any student of the behavioral sciences feel totally out 
of his depth with them. Many examples are taken directly from 
medical journals and reports (this makes fascinating reading at 
times—one often wonders what the original authors of these reports 
did with their data) and if one can stomach a seemingly morbid 
preoccupation with bleeding rates, and retain the appropriate re- 
spect for esoteria such as retrolental fibroplasia, the problems can 
be illuminating. 

Surprisingly, perhaps, the demands made of the biomedical stu- 
dent in terms of scope appear from this book to be a little less ex- 
tensive than is the usual practice in psychology and education 
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courses. There are twelve chapters, which together cover the usual 
descriptive statisties, and a smattering of the notions of statistical 
inference. If there were a greater depth to the content, there would 
be reason for enthusiasm about the book, for it is written in that 
partieularly clear, relaxed style that makes for easy reading and 
comprehension. The necessary statistical tables are presented in an 
appendix, together with answers to some of the exercises. 

Chapter 1 gives a brief and not-too-useful definition of statistics, 
and a few equivocal objectives in studying statistics. For the be- 
haviorist, these objectives will cause a raised eyebrow or two, al- 
though if one allows for the terminology most of the objectives 
could very well be those of any education or psychology course, or 
at least a start on a parallel list. This first, brief, chapter (it is 
only four pages long) is rather out of character with the rest of 
the book: and is decidedly its low point. 

Chapter 2, concerned with the concepts of population and sample, 
more than makes up for the shaky start. In the author's delightful 
style, a distinction is drawn between these notions at a stage in the 
instruetional process that; is not often done, and with clearly bene- 
ficial results for the subsequent chapters. 

The third chapter spends time (rather too much time, in the re# 
viewer’s opinion) on graphical display of data. Chapter 4 con- 
cerns itself primarily with the arithmetic mean and the median as 
measures of location. It is interesting to note on a contrasting basis, 
that the mode gets no more than a one-line mention as “another 
possible index.” Similarly, the only variation index that is treated 
is the standard deviation (Chapter 5). These very specific selections 
of content by the author were no doubt conscious—the comparison 
with the usual psychological and educational statistics texts, which 
spend considerably more time on these descriptive measures, is & 
thought-provoking one. Chapter 5 also includes a rather elegant, 
though simple, discussion of sampling properties of the mean and 
variance, through presenting the notions of statistical bias in a very 
concise fashion. 

Chapter 6 skips through the fundamental properties of the “nor- 
mal curve” (the way the topic is discussed, the author probably 
really meant the normal distribution), and is followed in Chapter 7 
by an excellent overview of the idea of a confidence interval. As in 
the whole book, one wonders what heights the work could have 
reached if there were more meat: as & piece of writing it is beauti- 
fully clear and direct, but is so saltatory and perhaps oversimple, 
that it must leave many frustrated biomedics in its wake. But then 
—maybe the biomedical mind does not work that way. к 

Chapter 8 is probably the best balanced chapter in the book: it 
deals with tests of hypotheses on population means. Chapters 9 and 
10 deal, respectively, with data as proportions, and the chi-square 
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distribution. There is nothing new or particularly fresh here: the 
kinds of offering are much the same as those the reader is used to 
in the behavioral sciences. Chapter 11, however, puts an emphasis 
where perhaps the audience of EDUCATIONAL AND PSYCHOLOGICAL 
MzASUREMENT does not (or at least to the same extent): in one of 
the lengthier chapters the author diseusses variance estimation, and 
concludes with the basics for the F-test. The final chapter—chapter 
twelve—provides an introduction to correlation and regression. 
In sum, there is little in this book that is new, or that would make 

it specially attractive to the instructor in education or psychology. 
Perhaps its being made available for review is its most valuable 
contribution, Comparisons are not always odious—and if this is a 
typical biomedical statistics text at the beginner level, educational 
and psychological statisticians would seem to be doing at least a 
comparable job. 

РЕтЕВ А. TAYLOR 

Rutgers University 


Elementary Statistics (Second Edition) by Paul G. Hoel. New 
York: John Wiley and Sons, 1966. Pp. ix + 351. 

N The second edition of Hoel's basic statistics text remains very 
similar to the first in general orientation and emphasis. It is pri- 
marily a descriptive book for students in applied areas with mini- 
mal mathematical preparation. The approach is largely intuitive, 
but the basic concepts are not slighted because of that. Hoel’s pres- 
entation of classical sampling concepts has a good deal of depth and 
provides the basis for much of his exposition. 

The most noteworthy changes match a number of recent trends in 
elementary texts. Hoel has placed increased emphasis on sampling 
distributions, has expanded his treatment of elementary probability 
and cast much of it in geometrical terms, and has included a some- 
what greater variety of material. Some of this added material is of 
dubious value. Very short sections on multiple and nonlinear regres- 
sion have been supplied which almost certainly serve no functions 
other than to make the student aware of their existence and which 
may be confusing as well. 

In outline, Hoel begins with a short discussion of the nature of 
statistical methods and then proceeds to descriptive measures of 
sample data. He makes the very valid point that grouping data may 
be neither necessary nor advisable in these days of the ubiquitous 
computer. Hoel then talks about the basic ideas of probability. In 
the interests of simplicity, the development of permutation and 
combination formulae is somewhat clumsy, but the discussion of 


Bayes’ theorem is an excellent elementary presentation. The intro- 


duction to sampling distributions which follows is better than aver- 
age. The sampling chapter is generally well done and the discussion 
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of stratified and proportional sampling—two important topies often 
ignored in elementary texts —is very pleasing. The chapter on esti- 
mation, which often also gets short shift in beginning texts, is good. 
Hoel's treatment of hypothesis testing begins with consideration of 
Type I and Type II errors and builds upon these central ideas. 
The chapters on correlation and regression are, with the exceptions 
noted above, clearly developed and sound. 

Under the heading, ^Special Topics," Hoel presents chapters on 
chi-square, non-parametric tests, analysis of variance (anova) and 
time series and index numbers. The author’s idea is that these topics 
are less central to an introductory course in statistics than those 
mentioned in the paragraph above and should be considered at the 
end if time allows. With the exception of chi-square, this is probably 
a reasonable approach. The material on chi-square is based almost 
completely on intuitive appeal, but is sound as far as it goes. The 
non-parametric chapter is naturally rather limited, but does men- 
tion rank correlation. One-way classification anova is done well, but 
the two-way section slufis over just too many points from model to 
assumptions to expectations. The chapter on time series and index 
numbers appears to be purely an appeal to business and economics 
sections in applied statistics. The treatment is rather good at the 
level attempted, especially of index numbers, but it certainly does 
not mesh with the rest of the book. 

In terms of general consideration, Hoel’s text must be rated as 
superior. The writing is lucid, the examples are clear, the approach 
is steadily intuitive and does not take cover in mathematics when 
difficult points approach. The problem sets are interesting, useful, 
and plentiful, but the tables are less complete than one could wish. 


Overall, the book should provide a useful text in many applied 
e in el tary statistics. 
ourses in elementary ү Жыш! 

Iowa State University 


Statistics: An Introduction to Quantitative Economic Research by 
Daniel B. Suits. Chicago: Rand McNally and Co., 1963. Pp. 


xix + 260. iji; 
It is Ri for the writer of an introductory statistics textbook to 
overwhelm the student by including far too much material for a one- 


semester course. In addition, it is tempting to strive for comprehen- 
tical derivations and proofs 


siveness by including many mathema ү н 
which are not essential to ће beginning student's grasp of some 0! 
the basie concepts and the statistical way of looking at things. How- 
ever, the author of this little volume on introductory statistics for 
students of economics has attempted to be brief, clear, and pen 
and without being cavalier or pedantic. Within the ten chapters о! 


this textbook is included much of the foundation of statistics for s0- 
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ial science in general. The author writes well and includes numer- 

кш illustrations and examples, both within the chapters them- 
selves as well as in the several questions and problems following 
each chapter. $ 

The topics surveyed are standard for elementary economics sta- 
tistics, viz. frequency distributions, central tendency, dispersion, 
sampling, point estimation, confidence intervals, stochastic func- 
tions, significance tests, linear and nonlinear regression, time series, 
and index numbers, The majority of these topics are, of course, rel- 
evant to many of the other social sciences. The introductory ma- 
terial presented in the first four chapters of the book leads into the 
central topic of stochastic functions in Chapter 5, and the last five 
chapters deal with special applications. In summary, the book is 
well-written, representing more of a verbal than a mathematical 
exposition. The focus is on clear explanation and a practical ap- 
proach through numerous examples. Consequently, it should prove 
to be more popular with students than most introductory statistics 
books. í 

Lewis R. AIKEN, JR. 
Guilford College 
A Programmed Introduction To Statistics by Freeman F. Elzey. 
Belmont, California: Wadsworth Publishing Co., Inc., 1966. 
(Paper) Pp. viii + 376. 

A Programmed Introduction to Statistics is yet another attempt 
to sugar the statistical pill that many students claim to find unpal- 
atable. The author has contrived—and not without a modicum of 
success—to encapsulate his topics in an easily swallowed tablet, the 
efficacy of which will depend to a large extent on the statistical ill- 
health of the patient. 

The book is written for students “, . . being introduced for the 
first time to statistical techniques . . . the focus is primarily upon 
the student who is unfamiliar either with the basie concepts of sta- 
tistical techniques or with the mathematics needed to apply these 
techniques.” This purpose, many instructors of statistics will find 
appealing. And, indeed, their interest would not be misplaced if they 
Were concerned with providing a supplementary diet for the statis- 
tically infirm. Healthier appetites will not be satisfied, 

The text is organized into twenty-four topics or sets. The number 
of frames per set ranges from a low of thirty-one to a high of sixty- 
two with a mode in the fifties. Sets cover the 
statistics, basic statistical inference, chi square, an 
of variance. The order of presentation is logically 
a certain smooth flow to be maintained throughout the text, Of 
course, the content is not uniformly difficult; yet 


у the difficulty is not 
reflected in the number of pages devoted to а topic. Each set is be- 


usual descriptive 
d simple analysis 
sound and enables 


BOOK REVIEWS 535 


gun with a statement of Из objectives, which, although possibly 
alarming on first confrontation should ultimately clarify for the stu- 
dent the purpose and meaning of each statistic. At the end of each 
set, the author has provided a number of exercises to be completed 
in a non-programmed format. (Answers are provided in the appen- 
dix). These exercises are a valuable addition and an attractive con- 
cession to the more traditionally oriented instructor. 

The frames are, generally, well written and attest to thorough 
field testing. There are a few ineptly-phrased frames and there are a 
few mathematically or grammatically questionable expressions. There 
are also a few dogmatic assertions, (“Statisticians generally agree 
to use P = .0500 as the cut-off point in accepting or rejecting 
the [null] hypothesis.”). The order of presentation also varies 
slightly from the traditional in that correlation is postponed to the 
later parts of the book, while analysis of variance appears relatively 
early. Anyone familiar with current elementary statistics texts will 
find nothing extraordinary herein. A unique feature of the book, 
however, is its collection of formulae and symbols in appendices. A 
minimum set of statistical tables is also provided. 

The question then remains: To what extent has the author at- 
tained his stated purpose of providing a nonmathematical intro? 
duction to statistics? Clearly it is an extravagant desire to present 
statistics without any reference to mathematics. But the author has 
been highly successful in minimizing the need for algebraic compe- 
tence. This leaves a flavor of the “do-as-I-say” variety: it possibly 
results in a student/s being tied to environmental cues of this par- 
ticular text. It may be that its intended audience will not wish to 
use statistical methods but will want only to be able to read quanti- 
tative data with insight. To this extent, the book will probably suc- 
ceed. 

The book should find a substantial market among students who 
are seeking an alternative presentation to that offered them in their 
current courses. Or, it could provide instructors with a formalized 
supplement to their classroom materials. For these special audi- 
ences the book will be attractive and is probably deserving of some 
attention. 

PETER A. TAYLOR 
Susan F. Forp 
Rutgers University 


Defining Educational Objectives: A Report of the Regional Com- 
‘mission on Educational Coordination and the Learning Research 
and Development Center by C. M. Lindvall, Editor. Pittsburgh: 
University of Pittsburgh Press, 1964. Pp. xii + 83. $1.50. 

In. view of the recent emphasis upon the evaluation of federally- 
supported educational programs that are considered to be exemplary 


536 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and innovative, the contents of this small booklet are indeed valu- 
able to the person in curriculum who is trying to define and to de- 
scribe in operational terms specific educational objectives, the for- 
mation of which must necessarily precede systematic efforts at 
evaluation. Based upon a meeting of the Regional Commission on 
Educational Coordination held April 1963 on the University of Pitts- 
burgh campus, the papers in this volume afford the individual con- 
cerned with problems of curriculum development and with the eval- 
uation of the effectiveness of instructional procedures associated 
with the implementation of the curriculum an opportunity to keep 
abreast of the most recent thinking. 

Following an introductory chapter prepared by the editor are five 
chapters, each of which represents a contribution by the major par- 
ticipants at the conference. Lindvall, Nardozza, and Felton discuss 
at length the importance of specific objectives in curriculum devel- 
opment, in relation to the Curriculum Continuity Demonstration 
(CCD) project of the Pittsburgh Public Schools and the University 
of Pittsburgh and furnish illustrative examples concerning how spe- 
cific objectives for a unit in seventh grade social studies can be 
stated. In the next chapter Krathwohl develops at length a taxon- 
omy of educational objectives in both the cognitive and affective 
domains—a reiteration of the work which he and Benjamin Bloom 
and their distinguished associates have developed since the early 
1950's. Specific uses of the taxonomy in curriculum development are 
cited and illustrated. 

The next two chapters place particular emphasis upon the rela- 
tionship of instructional objectives to learning. In the fourth chap- 


| 
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fected through their being made explicit as instructional objectives, 
through conscious attention to their being taught, and through the 
evaluation of their occurrence. It is the authors’ thesis that ob- 
jectives should be measured in relation to the content of achieve- 
ment attained, not in terms of the relative level of performance of 
one student with another. 

In the closing chapter Tyler comments at length upon four major 
points that are often raised in any consideration on the subject of 
educational objectives: (1) his concern about the importance of a 
clear definition of objectives when teachers themselves have done 
excellent work without having a lucid statement about their goals; 
(2) the need for a clear definition of an educational objective in 
terms of what it is that а student should be expected to do and in 
terms of how he should be able to think or feel in the attainment 
of this objective; (3) the determination of the degree of specificity 
to which the educational objectives should be stated; and (4) the 
considerations that need to be taken into account in the selection 
of educational objectives. 

In short, this book affords both the teacher and the professional 
individual involved in curriculum development and evaluation a 
useful set of guide lines that may be expected to result in substantial 
improvements in educational programs. Although there are few 
places where additional examples of realistic applications could be 
made advantageously, there appears to be for the most part a highly 
favorable balance between theory and practice. The educational 
profession can be grateful for the systematic and integrated efforts 
and for the heuristic directions that this small volume contains. 

WinuiM В. MICHAEL 
University of California 
Santa Barbara 


Experimental and Quasi-Experimental Designs for Research by 
Donald T. Campbell and Julian C. Stanley. Chicago: Rand Ме- 
Nally & Company, 1963. Pp. ix + 84. 

Teachers of experimental design in psychology and education as 
well as their graduate students can be happy that the editorial staff 
of Rand MeNally and Company have seen fit to publish as а 
separate volume the fifth chapter by Donald T. Campbell and Julian 
C. Stanley from the Handbook of Research on Teaching edited by 
N. L. Gage and published in 1963 under the longer title “Experi- 
mental and Quasi-Experimental Designs for Research on Teach- 
ing.” Subsequent to nearly six introductory pages concerned with 
experimentation in education and the behavioral sciences, the writ- 
ers discuss at length three pre-experimental designs, three true ex- 
perimental designs, and ten quasi-experimental designs. Each one 
of these designs is evaluated in terms of eight criteria relevant to 


538 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


what they consider to be contaminating influences upon the in- 
ternal validity of an experiment and in terms of four factors that 
jeopardize the external validity or representativeness of an experi- 
ment. Accompanying each paradigm which is included to represent 
the basic structure of the operations involved in each of the sixteen 
designs described is a helpful discussion that includes numerous il- 
lustrations of typical sources of error and contamination encoun- 
tered in experimentation involving that particular design. 

Three helpful tables are incorporated within the text to show at a 
glance sources of both internal and external validity for each of the 
designs—tables in which plus marks are placed to indicate that the 
source of invalidity is absent or probably absent in the design, in 
which negative signs reveal the probable presence of a source of in- 
validity, and in which a question mark reveals a possible source of 
concern. Omission of any mark reveals that the factor is probably 
not relevant. Although such a scheme of indicating sources of in- 
validity is admittedly incomplete and oversimplified, it nevertheless 
affords a quick and helpful overview to the student and researcher 
who wishes either to evaluate the adequacy of the design of an ex- 
periment which he has read or to prepare a design of his own as free 
of contaminating sources of error as possible. This checklist ap- 
proach is also very helpful when one wishes to review the contents 
of this very helpful booklet after he may not have been thinking 
about its material for a number of months, 

As a teaching aid this short volume should be required reading in 
any course in experimental design for students in psychology, edu- 


cation, or sociology. Although it cannot substitute for original 
thinking about experimentation, 


many glaring errors. The closing 
ex post facto designs is particularly helpful to the person involved in 
field studies and in loosely-controlled research investigation. The 
bibliography of approximately 150 references in addition to the set 
of twelve supplementary references constitutes a helpful if not in- 
dispensable guide to further study. 


Wiuaw B. Міснлв, 
University of California 
Santa Barbara 
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Understanding Educational Research (Rev. ed.) by Deobold B. 
Van Dalen and William J. Meyer. New York: McGraw-Hill 
Book Company, Inc., 1966. Pp. x + 525. 

Van Dalen's revised edition is considerably improved over the 
initial volume. The treatment of the interrelationship between re- 
search and society and of the assumptions underlying the utilization 
of the scientific method is clear and authoritative. In the first four 
chapters a sturdy foundation and sound justification is developed 
for the use of the scientific method in solving problems and acquir- 
ing information in the field of human behavior. A consistent plea is 
made throughout the book for researchers to relate their experi- 
ments to theoretical considerations. This plea, if heeded, would do 
much to systematize much of the disjointed and isolated efforts 
presently evident in the field. i 

Two chapters are devoted exclusively to helping the reader utilize 
the resources of the library. One chapter on printed sources is en- 
cyclopedic, citing almost 100 different dictionaries, almanacs, direc- 
tories, indexes, etc., but contains insufficient emphasis on basic 
sources as distinguished from peripheral materials, e.g., the Hand- 
book of Research on Teaching is treated in one sentence. The very 
elementary chapter on library skills appears to be unnecessary for 
several reasons. A book which attempts the formidable task of 
bringing the reader from a state of complete naiveté concerning 
research to an understanding of multifactor ANOVA design, can ill 
afford the space devoted to such an elementary subject. Secondly, it 
seems inappropriate for this presentation to be included for the 
audience for which it is intended, namely, “. . . for mature upper- 
classmen, graduates pursuing their master’s degree and doctoral 
candidates who have had only a limited exposure to the scientific 
method.” If such persons have not acquired most of, the necessary 
information and skills involved in using a library, a reference to a 
concise treatment of this topic would serve the remedial task with- 
out offending the more scholarly student or the student seeking jus- 
tification for the “mickey mouse” and/or redundancy hypotheses in 
education courses. Thirdly, learning to use the library is somewhat, 
like that venerable example, learning to swim. The most efficient 
learning results when one is in a state of need and seeks the assis- 
tance of a librarian who can prevent or stop floundering by a re- 
directing of activity based on specific problems. A few visits to the 
library and/or a class session with the librarian can produce more 
learning than a considerable amount of reading on the subject. The 
idiosyncratic nature of university libraries lends support to the lat- 
ter alternative. ы 

The chapters on historical and descriptive research are largely 
unchanged from the first edition. They are clear and concise 
treatments of general concepts. 


540 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The chapter on experimental research has been completely те- 
written for this edition and, consequently, been vastly improved in 
content. The author draws heavily on the now-classic Campbell- 
Stanley chapter on experimental and quasi-experimental designs for 
research on teaching. Eight designs are selected and discussed from 
the sixteen presented by Campbell and Stanley. Each design pre- 
sented is discussed and criticized in terms of threats to internal and 
external validity. Evaluating designs in terms of the eriteria sug- 
gested by Campbell and Stanley would surely result in a significant 
increase in the quality of educational research. 

A perplexing problem peculiar to books in this general area of 
educational research is in resolving the question of how to treat 
statistics. Should the subject be avoided? Should there be an as- 
sumption of some competence on the part of the reader? Should 
statistics be isolated, or woven into the fabric of the entire book? 
Should both descriptive and inferential statistics be treated? This 
problem has been resolved by various authors in every way im- 
plied by the questions above. Van Dalen resolved the issue in a 
unique way: Statistics are not only isolated in two separate chap- 
ters, but these chapters are written by another author, William J. 
‘Meyer. A commendable job of presenting so much information in so 
little space (83 pages) is evidenced in these chapters. The material 
has been extensively revised but, unfortunately, is not well inte- 
grated and coordinated with the remainder of the book, especially 
with the treatment of experimental design. An additional organiza- 
tion inconyenience is the treatment of sampling designs in the chap- 
ter, “Tools for Research,” rather than along with internal-external 
validity framework of the chapter on experimental research, 

_Meyer presents a logical and tightly organized development be- 
ginning with frequency distributions and ending with three-way 
analysis of variance and repeated measures designs, This is con- 
siderable ground to cover, and he does it well under the circum- 
stances. However, there is little justification for including the soon- 
to-be-extinet grouped-data approach in computing measures of cen- 
tral tendency, variability, and correlation, All texts, as does Meyer, 
abandon this “short cut” when they progress to advanced topics 
such as the analysis of variance. In addition, computer availability 
has removed the basic raison d'être for groupi 


intervals and 2.025 — 2515. No indication is given that the number of 


SX needed to encompass the confidence interval increases as N 
decreases. 
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A statement by the author in the introductory paragraphs of the 
chapters on statistics brings this reviewer back to one of two major 
substantive criticisms of the book. He states: 

The materials developed in these chapters illustrate statistical 
methods for evaluating data derived from the various research 
strategies and methods described in Chapters 10 and 11, The 
reader will do well to relate the statistical procedures in these 
chapters to the materials in the earlier chapters. 

This admonition to the reader is not a good substitute for a co- 
hesive and integrated development of these interrelated sets of 
concepts. 

The second basic criticism results from the inordinately broad 
scope of the book. To master the statistics included alone would 
require two semesters. 

The reviewer is well aware that alternate presentations would not 
be exempt from the same and other criticisms; the negative com- 
ments contained in this review should not be allowed to detract dis- 
proportionately from the general high quality of this book, It com- 
pares quite favorably with its competitors. 

Percy D. PeckHAM i 
KennetH D. HoPKINS 
Laboratory of Educational Research 
University of Colorado 


Unobtrusive Measures: № onreactive Research in the Social Sciences 
By Eugene J. Webb, Donald T. Campbell, Richard D. Schwartz, 
and Lee Sechrest. Chicago: Rand McNally & Co., 1966. Pp. 
v + 225. 

thet authors of this book feel that a set of criteria is needed by 
which research techniques can be appraised morally. Of specific 
concern to the authors are the privacy of the individual, his free- 
dom from manipulation, the protection of the aura of trust on 
which the society depends, and the good reputation of social sci- 
ence. 

Chapter 1, “Approximations to Knowledge” criticizes the fact 

that most social science research is based upon interviews and ques- 

tionnaires at the expense of what the authors call multiple орега- 
tionism, a collection of methods. Threats to the valid interpretation 
of a difference are divided into three groups: error that may be 
traced to those being studied; error that comes from the investi- 
gator; and error associated with sampling imperfections. { 
Chapter 2, "Physical Traces: Erosion and Accretion” examines 
research methods involving physical traces from past behavior. 

Physical evidence is divided into two broad classes: erosion meas- 

ures, where the degree of selective wear yields the measure; and 

accretion measures, where the 
materials. 


research evidence is some deposit of 
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Chapters 3 and 4 deal with the use of archives by social scientists. 
It is suggested that such a source of data may serve to avoid the 
problems of invasion of privacy by allowing the researcher to ob- 
tain valuable information without identifying or manipulating the 
individuals involved. 

Chapter 5, "Simple Observation" suggests that the unobserved 
observer may appropriately function in certain situations but not in 
others. The main advantage of the use of this technique is that data 
are obtained first-hand. 

Chapter 6, “Contrived Observation: Hidden Hardware and Con- 
trol” discusses the use of electronic devices to acquire data. 

The book admittedly deals with elementary social science re- 
search measures. The sophisticated researcher may profit from a 
reminder as to techniques he has previously studied; the beginning 
researcher may be introduced to research techniques for the first 
time. 

The real value of this book may well be as a complementary vol- 
ume to a text in introductory research classes in the social Sciences. 
In this way students who plan to adopt a behavioral science ori- 
entation and students who plan to adopt a humanities orientation 
may see how the authors’ ideas may be employed in secondary 
school teaching. There is a great need for a book which considers 
empirical approaches adaptable to both the humanities and the 
behavioral sciences. This book may be most useful in this respect, 

Date І, BRUBAKER 
University of California 
Santa Barbara, California 
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the present study by Kropp, Stoker, and Bashaw not only rep- 
resents а major accomplishment when judged against the relative 
paucity of previous work, but also is the single most comprehensive 
attempt to date to deal with the Taxonomy on an empirical basis, 
The high quality of the work, the fundamental issues that are 
identified and discussed, if not resolved, and the sophistication of 
the analytic methods employed rank it as the landmark of research 
in this area. 

The research addressed itself to three major questions. These 
were: 

1, Can empirical evidence be found to support or refute the im- 
puted hierarchal structure of the Taxonomy? 

2. Can empirical evidence be found to support or refute the im- 
puted generality of the several cognitive processes? 

- 8. Can each level of the structure be explained by more elemental 
cognitive aptitudes, and, if so, do the combinations or numbers of 
them change systematically from one major level to the next? 

. To investigate these questions, the researchers first constructed 
four taxonomy type tests. Each test consisted of two parts: one for 
Knowledge, Comprehension, Application and Analysis, and one for 
Synthesis and Evaluation. The tests were of the reading comprehen 
sion type in which a short reading passage preceded, in the first 
part, multiple choice items, and, in the second part, free response 
items. Each subtest in the first part contained twenty four-choice 
items while the Synthesis section contained five items and the 
Evaluation section ten items. The scoring procedure permitted a 
maximum of twenty points per taxonomic level. The reading pas- 
sage for each test was selected on the basis of high probable in- 
terest, ease of comprehension, and unfamiliarity to students. The 
last two specifications were deemed crucial because of the necessity 
that subject matter mastery of students be as nearly equal as pos- 
sible so that score variability would reflect differential mastery of 
the cognitive processes. Of the four passages that were selected, two 
dealt with social science content, the Lisbon Earthquake and 
Stages of Economic Growth, and two dealt with science content, 
Atomic Structure and Glaciers. Finally, the tests were so construc- 
ted as to be used with ninth through twelfth grade students. Ad- 
ministration time for each part of each test was fifty to sixty min- 
utes. Standard procedures for pretesting the items in each test were 
followed. Revision of the tests was based on item analysis data, ob- 
servations, and reactions of students, and some additional consider- 
ations which had to be developed especially for the study. 

In addition to the four taxonomy tests, thirty-seven tests from the 
Kit of Reference Tests for Cognitive Factors were selected to ob- 
tain evidence on cognitive aptitudes. About 1600 students were ad- 
ministered the taxonomy tests at each of the four grade levels and 
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about 275 students at each grade level were administered the cog- 
nitive reference tests. The multiple choice items on the taxonomy 
tests were electronically scored, while the free response items were 
scored by an elaborate set of procedures for which substantial re- 
liabilities were demonstrated. Finally, the cognitive reference tests 
were scored by a separate scoring staff. Quality control checks were 
run on all scoring procedures and full reliability data are presented 
for all instruments. 

Six types of analyses were carried out to answer the main ques- 
tions posed for study. First, mean scores were obtained for each 
taxonomic level for each grade to obtain a test of the relationship of 
inverse relationship between mean performance and taxonomic 
level. Second, the hypothesis of simplical structure underlying the 
taxonomy was studied through techniques devised by Kaiser for 
scaling a simplex, These analyses were carried out for each test 
form for each grade and for each form over all grades. Third, the 
issue of the transcendence of taxonomic processes over content levels 
was tested by means of factor analysis and Guttman’s cireumplex 
analysis. Four 24 24 correlation matrices (four tests six process 
levels) were factor analyzed—one for each grade level. The unro- 
tated matrices were interpreted because of the investigator’s feel- 
ing that available analytic rotational procedures were not appli- 
cable with the data. Fourth, matrices consisting of single-process 
level intercorrelations over test forms (and hence content levels) 
Were examined for circumplex structure, Fifth, factor analyses of 
the 37 Х 37 correlation matrices of cognitive reference tests for each 
grade level were carried out. Ten factors were identified and or- 
thogonal cognitive aptitude scores were determined for each stu- 
dent. Finally, factor scores were used in multiple regression analyses 


to “explain” performance at each taxonomic level for each test at 
each grade level. 


full report for a detailed presentation. Fortunately, the authors 


ository style that makes the report 
‚Ше Judicious use of summarizing 
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reversal of the two top levels oceurs in the science area and not in 
the social science area is an interesting problem deserving further 
research. 

The test of the hypothesis of transcendence of taxonomic proc- 
esses over content areas did not yield clear-cut results. The most 
apparent reason for this failure to achieve unequivocal results can 
be attributed to the lack of an appropriate analytic rotational pro- 
cedure and the consequent use of unrotated factor matrices. The 
appearance of some factors that were mixtures of content and 
process suggests that scores on the tests are determined by complex 
interactions of the two. On the other hand, some support for the 
transcendence of process was demonstrated through the circumplex 
analyses, although no perfect circumplexes were found. 

The investigation of the aptitude structure of the taxonomic 
levels was also subject to methodological difficulties. Only eight 
common factors could be extracted instead of the hypothesized six- 
teen and the majority of factors were mixtures of content and proc- 
ess. Certain patterns did, however, emerge which indicated syste- 
matic changes in factor structure over taxonomic process levels and 
grades, The authors indicated that these results must be considered 
merely suggestive because of the need for more refined taxonomie 
tests and the development of better statistical procedures for treat- 
ing the data. 

In addition to the quantitative data gathered and analyzed, the 
authors dealt with a number of issues relating to the taxonomy. 
The first of these arose in the test construction phase and involves 
the classification of items into taxonomic levels. The investigators 
report difficulties among the project staff in classifying items as to 
intended taxonomic categories, especially at the level of synthesis 
and evaluation. Some of these difficulties seemed to stem from a 
decision made at the outset that all items were to be of the multiple 
choice type. When it was decided to abandon this format and use 
free response items at the levels of synthesis and evaluation, some 
of the problems of classification were dissolved. Classification prob- 
lems did remain, however. Other investigators have commented on 
these problems both at the level of classification of items into the 
sub-categories of the taxonomy processes as well as into the six ma- 
jor categories. It would seem that further work is needed in clarify- 
ing the various taxonomic categories. One procedure which this writer 
has found useful in classifying items into the taxonomic categories 
is to have a precise description of the intended target population 
on which the items are to be used, including some statements as to 
various learning experiences they have undergone. This is often of 
crucial importance in classifying items in that it makes it possible 
to determine the amount of novel material in test items, a sine qua 
non for distinguishing between various taxonomic process levels. 
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A side issue arising from the classification of test items is the pro- 
portion of items in each taxonomic level. In the present study, the 
authors decided to have twenty items at each of the first four levels 
and the number of items at the levels of synthesis and evaluation, 
while smaller, did yield a maximum score of twenty points. While 
this made it possible to compare mean scores across taxonomic 
levels directly, the authors probably would not claim for one min- 
ute that this reflects the amount of weight they would accord each 
process level in terms of educational value. In general, one finds 
that the classification of items from existing tests into taxonomic 
levels usually reveals a highly disproportionate picture. Stanley and 
Bolton have reported a project in which about fifty per cent of the 
items from a standardized test circa 1954 were classified at the 
knowledge level with decreasing proportions of items at each sub- 
sequent level of the taxonomy. This reviewer has found that 
approximately sixty-five to seventy-five per cent of the items 
from teacher constructed end of course examinations fall into the 
knowledge level. This is an alarmingly high percentage to this 
reviewer and, sometimes, to the teachers themselves, Frequently, 
such a finding is used as the basis for revision of testing proce- 
dures, 

The classification of test items into taxonomic categories is often 
a highly revealing endeavor. With some expert help, it can provide 
crucial insights into the level of goals that an educational enterprise 
has set for itself whether it be a single classroom or a national 
curriculum. Most educational enterprises claim that they seek to do 
something more than merely transmit information. The Taxonomy 
of Educational Objectives gives some rather clear indications of 
what this “something more" is, In any work with th 


- Thus, the failure to answer such knowl- 
e ascribed to failure to possess the in- 
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gory of the Taxonomy considerably. It is apparent that a good deal 
of additional thinking and study will be needed on this issue. 

Another set of issues raised by the study involve the relationship 
between complexity and difficulty. The continuum underlying the 
Taxonomy of Educational Objectives is complexity. That is, the 
developers of the taxonomy held the notion that objectives could be 
ordered with respect to increasing complexity. To what extent the 
concept of complexity is similar to difficulty is hard to say. In the 
present study, the examination of the imputed hierarchal structure 
of the taxonomy was accomplished through examination of mean 
level of performance on each taxonomic level. The means, of course, 
are strictly a function of difficulty as the authors point out. To what 
extent this is an adequate measure of complexity is a most inter- 
esting problem. To what extent, if any, these two concepts can even 
be separated is a thorny theoretical as well as methodological prob- 
lem. Furthermore, the issue has distinct ethical overtones. As the 
authors point out, 

“possession of the item analysis data provided the staff with in- 
formation which bore directly, through a chain of fixed activi- 
ties, on the ultimate outcome of testing the hypothesis about the 
structure of the taxonomy. How the analyses were used could 
finally shape the result of testing the hypothesis. An example 
will clarify this matter. It is claimed that the taxonomy is hier- 
archal; consequently the investigators arranged several tests 
of this hypothesis, one of which was that mean difficulty on sub- 
tests would increase as the level of the subtest increased in the 
structure. Thus, item analysis data provided the opportunity to 
select items which would determine whether the hypothesis 
would eventually be accepted or rejected.” 

The investigators managed to extricate themselves from this prob- 
lem by deciding to include an item in the test if it correlated posi- 
tively with its subtest score and was free from technical defici- 
encies. Thus, evidence of difficulty was deliberately excluded from 
the item selection process. This was a decision which avoided ethical 
hang-ups but which will require additional thinking and the de- 
velopment of new criteria if the construction of taxonomy type tests 
is to become a serious, major enterprise. i 

One more issue which also is at the heart of the Taxonomy in- 
volves the differentiation between intended and actual processes, 
The Taxonomy of Educational Objectives sets forth a description of 
cognitive processes while taxonomy type tests are designed to evoke 
these processes in the solution of problems and the answering of 
questions. To what extent taxonomy-type tests do, in fact, evoke 
the intended processes has yet to be demonstrated. It is rather ap- 
parent to the authors that studies of process will have to accompany 
the development of taxonomy-type tests. Again, the need for fur- 
ther research is quite clear. 
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The above discussion of problems and issues related to the pres- 
ent study is not intended as criticisms of the work of Kropp, Stoker, 
Bashaw and their co-workers. Nothing could be further from the 
truth. The issues were raised by the investigators themselves and 
represent a fascinating complex of theoretical and methodological 
problems. The research itself is the most thoughtful and careful 
work to date on the Taxonomy, and the authors are to be congratu- 
lated on a first rate job. The limitations in the study stem solely 
from a lack of adequate procedures for dealing with the phenomena 
under study. Perhaps one of the most interesting set of outcomes 
of the present study is not the quantitative findings but rather the 
issues and problems raised. To date, no one has given the amount 
of systematic attention to the use of the Taxonomy that the present 
investigators have. It is extremely valuable to have the results of 
their experience so well documented. It should be a fruitful source 
for further study for some time to come. 

A final acknowledgement should be made to the Cooperative Re- 
search Program of the U. S. Office of Education which supported the 
present study. At a time when federal support for education is be- 
ing spread on an ever widening basis, it is heartening to sce the re- 
50163 of this particular support program which has given such im- 
petus to educational research. It is hoped that the support for such 
basic studies will continue on an ever increasing basis. 

БлонАвр Wonr 
University of Southern California 


How to Prepare a Research Proposal by David R. Krathwohl. 
ое New York: Syracuse University Press, 1966. Pp. 50. 
This small booklet contains a wealth of practical suggestions for 
those seeking funds for behavioral science research. Such researchers 
are well aware of the existence of several Steps between the con- 
ception of a creative idea and the subsequent miraculous appearance 
of the resources needed to change the idea into a working project. 
One of the less romantic steps in this process, the preparation of 
the proposal to make the resources appear, can be a serious hurdle 
to the implementation or trial of an important idea. The material 
bes Krathwohl presents should certainly reduce the size of the 
The proposal outline followed in this material is t 
for the various programs of the U. $. Office of ATIA ae ae 
format is typical of that used by most government agencies, Pro- 
posals prepared. for philanthropie foundations can be simpler than 
is implied by this outline. Such foundations are interested in a state- 
ment of the problem, the objectives, and a brief overview of the 
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proposed procedure, all of which are basie parts of the more de- 
tailed government agency proposals. 

Most of the explanatory material is somewhat slanted toward 
the experimental study, a reflection of the bias of forms and out- 
lines which various agencies require. The author includes in an ap- 
pendix adequate aids to the proposer of other kinds of studies, with 
modified outlines and particular points to note for survey, predic- 
tive, methodological, equipment development, philosophical, histori- 
cal, and longitudinal studies. 

The material is well written for either the experienced proposal- 
writer or the beginner. It begins with a table of contents which is 
organized in such a way that it can be used as a checklist to ana- 
lyze the proposal's strength and weakness. This detailed table of 
contents is ten pages long which indicates the amount of detail in- 
cluded. The first section of the explanatory material is concerned 
with general questions about the effectiveness of the proposal. While 
some of the questions seem to be those which good sense would sug- 
gest to the proposal-writer in any case, some of them might very 
well be neglected in the confusion of the moment. In each case where 
the question appears to be rather obvious the author includes enough 
realistic explanatory material so that one may see the importance o: 
including the question. 

One such question, “Is the proposal a resubmission,” falls into this 
category of seeming triviality until one reads the explanatory notes. 
It seems logical that it would be necessary to do some revision on a 
proposal which had been rejected. In his discussion of the specific 
things to do to the proposal the author mentions the fine point of 
communicating with the rejecting agency. Not all agencies will re- 
veal the panel’s evaluation of strengths and weaknesses, but many 
grantors are willing to do so. In addition to instructing the re- 
searcher in how to make contacts, Krathwohl advises him on the 
proper evaluation and use of the information he receives. One such 
piece of helpful advice is that the proposal-writer make sure he 
understands the role of the staff in the funding decision. Does the 
staff make recommendations to the panel or is the decision entirely 
the responsibility of the panel? With this role clear, comments 
from staff members can be put to better use. } l 

In all cases the reader is drawn toward a better evaluation of his 


proposal by having his attention focused on the methods of evalu- 
Perhaps this constant reminding 


ation used by the funding agency. : 
is the most valuable contribution this booklet will make to the art 
of proposal-writing. When one is constantly thinking of those who 
will read his proposal and when he has before him a list of the spe- 
cific things these readers will be looking for, he is likely to do a 
much better job. This practical strength is combined with another 
which is a special bonus for the new proposal-writer. The general 
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tone of the material lends constant encouragement to the reader. 
The interesting treatment and sensible style should encourage any 
researcher to believe that he too can write a successful research 


proposal. 
HanLEEN W. McApa 
University of California 
Santa Barbara 


Structural Models: An Introduction to the Theory of Directed 
Graphs by Frank Harary, Robert Z. Norman, and Dorwin Cart- 
wright. New York: John Wiley and Sons, Inc. Pp. 415. $9.95. 

Directed graphs (digraphs) represent irreflexive relations, and 
they are composed of abstract points and lines. Each line joins 
exactly two points (which are its end-points) and is directed from 
one of the points to the other and thus it indicates their order; a 
point may be on no line, on one line, or on many lines. The pres- 
ence in a digraph of a pair of points and a directed line joining them 
indicates that the ordered pair of points is in the relation represented 
by the digraph. Since digraphs are composed of points and lines; 
they readily can be pictorially displayed, unless they are very large 
or complex. 

The empirical relations of a science (both observed and the- 
oretical) are representable by digraphs, and in the representation 
the digraph serves as an abstract structural model of the empirical 
relation. The abstract structure can be made real by means of a 
labeled drawing of the digraph. Visual inspection of this drawing 


should be a significant aid in the comprehension of the empirical 
relation. 


E (2) is points could represent ideas and the lines could represent 
e relationship of consonance (or dissonance) between ideas, 
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mat for facilitating calculation of quantitative characteristics of 
the relation and the elements being related. Examples are the dis- 
tance between elements, the centrality of an element, the status of 
an element, the balance of a relation, and the consistency (degree 
of transitivity) of a relation (which approximates an order relation). 

In the last two chapters of the book, digraph theory is extended to 
a more general and quite useful type of mathematical entity, viz. 
the net. This permits the relation to have sign (+, —, +), to have 
degree, and to be reflexive. 

This book is not required reading for the typical measurer of in- 
dividual differences. It does, however, provide a pleasant diversion 
(it is well written) for a person who enjoys finite mathematics. 
Moreover, if one is imaginative during his reading (the reviewer was 
not till he neared the end of the book), it may lead to valuable new 
ideas on how to measure interesting and complex psychological 
characteristics. 

RaPHAEL HANSON 
California State College, Long Beach 


Personality: An Objective Approach by Irwin G. Sarason. New 
York: John Wiley & Sons, Inc. Pp. xvi + 670. 

The legacy of Freud nurtured by his followers led the psychology 
of personality to decades of substantive concern with those or- 
ganized internal structures and processes which hypothetically de- 
fined the subject, matter, personality. Within this tradition empiri- 
cal research in personality was guided by and organized around the 
major theoretical positions, and the standard introductory course 
offered a formal survey of psychoanalytic, behavioral, phenomeno- 
logical, and field theories of personality. There is however, growing 
evidence of a movement away form substantive concern with “The- 
ories of Personality” in favor of a survey of empirical knowledge of 
personality as one field of psychological research. These develop- 
ments parallel earlier ones in the field of learning where systematic 
presentations of “Theories of Learning” have been replaced by 
data-based surveys of “learning theory.” jut г 

The movement from formal and relatively provincial analysis of 
theory to integrated surveys of empirical evidence in the light of 
theoretical issues is, perhaps, one sign of growth. In the field of 
personality this movement is now well underway. Theories are re- 
garded as heuristic tools to be judged on criteria of research-gener- 

via reference to hypothetical strue- 


ation; definitions of personality р с 
tures are eschewed. The only definition of personality attempted is 
as a field of study, and the definition of that field is steadily assim- 


ilating traditional differential psychology. Explicit concern with the 


phenomena of individual differences in ability and performance, 
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their origin, development and correlates, now characterizes the field, 
and this increasing concern with empirical study of individual dif- 
ferences has generated a new level of methodological sophistication. 
(One sign of such trends is the increasing use of descriptive sub- 
titles for personality texts which characterize the contents as “ор- 
jective,” “behavioral,” or “research” approaches. A more meaning- 
ful reflection of this trend can be obtained from comparison of a 
1966 personality test with a differential text from the last decade, 
such as Tyler’s 1956 edition.) \ 

Sarason’s “objective approach" to personality describes and de- 
lineates these developments in a text that is quite elementary, but 
rarely shallow. Personality is considered “аз an area of investiga- 
tion rather than as an entity, real or hypothetical"; the major task 
of personality study is formulated as the search for variables pre- 
dietive of differences across individuals and within an individual 
across time. Hypothesis-generation rather than hypothesis-testing 
is thus the order of the day and this is the criterion by which 
theories are to be evaluated. The “Theories of Personality” are re- 
formulated as “theoretical orientations,” and the task in the intro- 
ductory course is to develop the “inextricable tie between methodol- 
dgy and theory"—the inter-relationship of hunch, experimental fact, 
and theoretical hypothesis. 

Sarason’s effort to do this requires some 600 pages divided among 
five equal-length sections. The first of these is a survey of major 
theoretical views—the “old” course in theories of personality, but 
now presented in one-fifth the time. Parts II, III, and IV survey 
assessment techniques, experimental research and developmental 
analysis as the fundamental approaches to empirical study of per- 
sonality. A final section, part V, exemplifies these approaches via 
an analysis of deviant behaviors (psychopathology). This arrange- 
ment leads to some rather curious organizations; e.g., morphological, 
genetic and physiological factors are here considered as assessment 
tools. Yet, given Sarason's underlying argument that the proper 
task of personality research is that of discovering the relevant vari- 
ables, this is a meaningful organization that gives unity to his 
writing. 

Though standard in content (analytic/dynamie, learning, self and 


and Sarason’s personal in- 
has just edited two volumes 
and enthusiasm. 


ata generated by it. Evalua- 
student is provided with 
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useful models of careful criticism. Thus, psychoanalysis qua theory 
is rigorously evaluated for its “premature proliferation of interact- 
ing constructs that are ambiguously related to overt behavior,” but 
Sarason also suggests—and in a later chapter provocatively dem- 
onstrates—the immense heuristic value of the central constructs of 
analytic theory. 

Personality assessment is approached as a methodology appli- 
cable to many areas of empirical investigation. In seven tightly- 
written chapters the content ranges from interviews through 
structured and projective testing to situational, historical, and bio- 
logical assessment tools. Quite properly, stress is placed on method- 
ological issues, The logic of trait attribution—the critical dis- 
tinetions between observation and inference, between the meaning 
of a construct and its evidential base—is especially sound and 
sorely needed at the undergraduate teaching level. A careful review 
of research on the Rorschach leads to a critical evaluation and ul- 
timate rejection of the test in terms of a psychometric model. In- 
stead, it is viewed as a variant of the interview, and recent studies 
of content analysis and social-interaction variables are surveyed, 
And a brief but incisive analysis of response sets typifies the up-to- 
date nature of this work. 

Sarason’s emphasis on methodological problems in assessment and 
his central concern with construct validation are surely to be ap- 
plauded. But it is possible that he overdoes it. When he writes, e.g., 
of “construct validity in relation to the interview" it is not clear to 
what he refers. The task of identifying and analyzing those factors 
which determine interview behaviors is important; that some par- 
ticular interview behaviors reflect particular personality constructs 
such that specific testable hypotheses can be examined is probable; 
unfortunately, as Sarason himself argues, psychologists are as yet 
a long way from making these tests, and overworking the term, con- 
struct validity, risks making it a vacuous synonym for scientific in- 
ference. More importantly, the logic of construct validation, though 
well described is poorly exemplified. Thus, in an otherwise well- 
written account of the MMPI Sarason’s single criticism of the in- 
strument is to question the adequacy of the original samples used 
to criterion-key the items: “in a way, the scales of the MMPI can 
be expected to be no better than the diagnoses on which they are 
based.” Since it was evidence such as the empirical enrichment of 
the meaning of MMPI scales that led to the notion of construct 
validity, the student would be better served by detailed analysis 


of the phenomenon of “bootstrapping.” The chain of inference in 


lidation of assessment techniques is highly complex 
pado oration of the criterion pre- 


and frequently involves theoretical elab J i 
dictions as much as careful analysis of the test predictors; imn 
mental evidence both extends the network defining the constru 

and adds to the construct validity of the test. In short, current use 
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of the MMPI can not stand on the basis of its original validation, 
but that is precisely what construct validity is all about. 

This reviewer found the seven chapters on experimental person- 
ality research to be the strongest in the book. Sarason makes an ad- 
mirable attempt to wed personality theory and research: while not 
always totally successful, a chapter summarizing research on hu- 
man performance broadly generated by analytic theory is well 
done. Treatment of verbal behavior, the role of awareness in verbal 
conditioning, the experimental induction of attitude change, and the 
implications of such studies for personality assessment and behav- 
ior modification is particularly strong. Noteworthy also is the dis- 
cussion of experimenter bias, of instructional sets, subject and sit- 
uational variables which influence human performance through 
which numerous potential studies appropriately regarded as per- 
sonality research are suggested. 

Six chapters provide a useful, albeit conventional, consideration 
of personality development. Included is an effective analysis of the 
child as stimulus for parental response—the interaction of consti- 
tutional predispositions with early social learning history. Treat- 
ment of biological and constitutional factors in development is brief, 
But honest. Sarason does not do justice to such factors, but he does 
acknowledge this and persistently reminds his readers of it; in so do- 
ing he avoids—or at least extends—the myopic vision of most Amer- 
ican psychologists in evaluating developmental determinants, 

A final five chapters discuss the classification and modification 
of deviant behaviors and recent studies in assessment, experimental 
research, and developmental analysis of psychopathology. Since the 
relationship of psychopathology to normal personality is, in part, 
ап open empirical question, and one on which psychologists have 
strong feelings, the burden is on Sarason to justify devoting nearly 
20 per cent of his text to such behaviors. This he does quite well. 

1 not only a logical one, it provides a 
unity that would otherwise be lacking. Борай, is on алок 
study of psychopathology, and the strength of these chapters lies in 

advances in behavior modification and so- 
ement programs in treatment of deviant 


Ricuarp J. Rose 
University of Illinois 
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Sensitivity to People by Henry Clay Smith. New York: McGraw- 
Hill Book Company, 1966. Pp. x + 226. Text Edition. 

Sensitivity to people and sensitivity training are discussed and 
explained by most writers in very eclectic and nebulous terms, 
Evaluation of the results of training has been mainly through testi- 
monials and impressionistic judgment. In some cases the practice of 
sensitivity training begins to resemble mystical rites and incanta- 
tions. Smith, on the other hand, proposes a theoretical framework 
based on specific, precise research results. He proposes a new way of 
looking at existing data and of reanalyzing the results already so 
abundant in the literature and in the files of conscientious re- 
searchers. 

Sensitivity to people which is a fundamental capability should be 
of concern to everyone. By definition, “Sensitivity is the ability to 
predict what an individual will feel, say, and do about you, him- 
self, and others" (p. 3). Sensitivity to people is a measurable per- 
formance and as such, sensitivity training need not be evaluated via 
the testimonial or impression but through operationally defined cri- 
teria. The criteria which Smith employs are the components of judg- 
ment accuracy conceived by Cronbach (1955). These criteria are: 
level accuracy, spread accuracy, empathic accuracy, observational 
accuracy, stereotype accuracy, and individual accuracy. Smith uses 
this framework for the organization of the book, devoting a chapter 
to each of the components, In addition, there are five chapters deal- 
ing with education and with the implications and applications of 
sensitivity. 

The idea of measurement is interwoven throughout the structure 
of the theory of sensitivity proposed by Smith. The model in- 
corporating the six components of sensitivity focuses on the per- 
ceiver, the person being judged, and the interaction between the 
perceiver and the person. Level and spread accuracy are the com- 
ponents of the perceiver's judgments; stereotype and individual ас- 
curacy are determinants attributable to the person being judged; 
and empathy and observation accuracy are the determinants stem- 
ming from the interaction. In this scheme, Smith accounts for the 
differences in sensitivity. The sensitive person is one who is able to 
relate his judgments to an absolute or appropriate average and 
spread. He can put himself in the other person's position and is con- 
scious of, and receptive to the responses and performance in the 
interaction. Finally, the sensitive person has valid and relevant per- 
ceptions of groups and the capability to differentiate members 
within that group. Lack of sensitivity on the part of an individual 
can arise from errors or lack of accuracy in any one of the com- 
ponents. Identification of the component is tantamount then to the 
identification of training needs, and the criterion for the evaluation 
of training. While knowing what kind of error a person makes does 


556 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


not immediately explain why the error occurs, the diagnosis of train- 
ing needs is narrowed considerably. 

Certainly this is an elegant model in that it proposes to account 
for the variance in sensitivity and incorporates the fundamental 
measurement to transform an art into a science. Will the model find 
its way into practice? Probably not, since an art with the adorn- 
ment of mysticism and testimonial accolades sells much faster and 
at a higher price than does science. 

Perhaps the most valuable message this book has to offer is that 
regarding Sensitivity for Education (Chapter Eleven). Anyone who 
is genuinely interested in higher education would read this chapter 
even if sensitivity training is of negligible salience. For those to 
whom sensitivity is a salient issue, the book is extremely well writ- 
ten, comprehensible, and a must for their professional reading and 
self-development. 
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Small Group Research: A Synthesis and Critique of the Field by 
Joseph E. McGrath and Irwin Altman. New York: Holt, Rine- 
hart and Winston, 1966. Pp. ix + 601. $12.50. 

After nearly a decade at the task, McGrath and Altman have pro- 
duced a synthesis and critique of small group (group dynamies) 


ntive classification sys- 
ge in a sample of 250 studies. 
material with annotations sup- 
es just mentioned; although the 
are used in the first and second 


tem the relationships which emer; 
Third, 400 pages of bibliographie 
plied for the sample of 250 studi 
results of these annotated studies 


pages of comment on the small grou; 
odology, culture, and current level of knowledge. 
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studied. This system was validated in Phase 2 by determining its 
consistency with empirical results from past studies. This validation 
led to Phase 3, which involved a modification of the system. Assess- 
ment of the merit of the McGrath-Altman system requires some 
detail of each phase. 

Phase 1, definition of the system, was based on the assumption 
that verbal labels for variables (e.g., cohesiveness, conformity) are 
inappropriate for classification purposes, since different labels in 
different studies could pertain to a single operationally-common 
variable, and since the same label in different studies could pertain 
to several operationally-different variables, Rather, each agent (in- 
dependent, predictor) variable, and each resultant (dependent, pre- 
dicted) variable, was indexed by a data item classified according to 
six logically-defined parameters (the heart of the system). (1) Ob- 
ject. Does data item pertain to a characteristic of a Member, Group, 
or Surround? (2) Mode. Does data item pertain to State (static) or 
Action (dynamic) characteristic of the object? (3) Task. Is data 
item Descriptive (amount of the characteristic) or Evaluative (dis- 
crepancy of characteristic from a standard or ideal)? (4) Relative- 
ness. Is data item Relative (characteristic judged relative to other 
objects) or Irrelative (absolute judgement)? (5) Source. Is data» 
item obtained from a Member, Group (including its representative) , 
or External (investigator, investigator surrogate, instrument)? (6) 
Viewpoint. Is data item Subjective (from viewpoint of source), 
Projective (from viewpoint other than source), or Objective (from 
impersonal viewpoint) ? McGrath and Altman refer to these six 
parameters as fundamental operational properties of data items. 

Phase 2, validation of the system, was based on the extent to 
which predictions from the system were borne out by the results of 
a sample of 250 empirical studies. The authors predicted that the 
degree of relation between an agent variable and a resultant vari- 
able would be proportional to the number (0-6) of operational 
properties shared by the two variables. In the test of this prediction, 
the authors prepared a bibliography from which they drew a sample 
of 250 studies, a sample which yielded nearly 10,000 Brei 
tween agent and resultant variables. The authors then determin i 
the percentage of significant (p < .05) relations between agent ani 
resultant, variables which shared the same number (0-6) of opera- 
tional characteristics. For instance, significance was achieved by 
about 30 per cent of the relations between agent and resultant vari- 
ables which shared no operational properties. A similar percentage 
was obtained when the number of shared properties ranged from 
one to four, with five shared properties yielding 40 per cent, ях 
yielding about 50 per cent. The existence of only three percentages 
suggested to the authors that only three operational properties were 
contributing to a prediction of the empirical results. 
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Phase 3, modification of the system, was based on an investiga- 
tion of the empirical data with the intent of determining which of 
the six fundamental operational properties should be retained in the 
system. After an intuitively-guided investigation, McGrath and 
Altman decided to retain Object, Mode, and Source-Viewpoint con- 
sidered as a joint characteristic. 

This final operational classification system is so simple that it 
suggests a contribution which is either profound or banal. The eval- 
uation which follows is based only on an appraisal of the system, 
and, more specifically, only on the methodology used in the develop- 
ment of the system and on its conceptual basis. 

Regarding methodology, the selection of the 250 studies appears 
inappropriate for the critical use made of these studies in the vali- 
dation of the system. To be appropriate for validation purpose, 
these 250 studies should have been representative of small group 
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Using this sample of 250 studies, coders classified some 20,000 
data items according to the initial 6-parameter system. In view of 
the existence of standard quantitative methods for reporting coding 
reliability, the serious reader may wonder why the authors chose 
only to report that “reliability of coding of most types of informa- 
tion was adequate." 

While the development of the system involved a number of such 
methodological limitations, the system is even weaker with respect 
to its conceptual basis. Only two points will be discussed here. 

First, the rationale of the system is not made clear. The authors 
simply state that the rationale is based on the assumption "that 
data items can be described in terms of a set of six fundamental 
properties or parameters." But the sense in which these properties 
are considered fundamental is nowhere spelled out. Do the authors 
consider these parameters to be necessary and/or sufficient, and if 
so, on what basis? Out of an infinite number of possible sets the 
author chose a particular one. Why that one? On what basis do the 
authors believe that an operational classification is appropriate for 
psychological data? Is understanding better achieved when stable 
results are operation-dependent or operation-independent? These. 
questions are not considered in the book. t 

Second, the hypothesis that the McGrath-Altman operational 
characteristics are fundamental is tested (validated) with a crite- 
rion which also supports alternative, competitive hypotheses. Recall 
that the authors interpret as support for the fundamental nature of 
their operational characteristics the increasing percentage of sig- 
nificant (p < .05) relations between agent and resultant variables 
associated with an increasing number (0-6) of operational char- 
acteristics shared by agent and resultant variables. However, as 
agent and resultant variables become increasingly similar with 
respect to the Object being measured, the Source which supplies 
the measurement data, the two variables are more likely to be sim- 
ilar with respect to the conditions under which the measures are 
obtained, such as point in time, instruments used, temperature, and 
the like—characteristics which would be expected to result ina 
higher proportion of significant relations. Further, the range in the 
eriterion measure (percentage of significant relations) achieved on 
the basis of the McGrath-Altman characteristics can also 
achieved on the basis of the number of reported relations per study. 
As this number increases, the percentage of significant relations de- 


i orting 100 or more relations, only 21 
Wee | i hich is lower ‘than the 
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considerations, the possibility exists that the McGrath-Altman char- 
acteristics may be artifactual, rather than fundamental. Sex of 
author and type of publication (journal article vs. report) also sep- 
arate studies into categories having percentages of significant rela- 
tions similar to those which emerge on the basis of the McGrath- 
Altman characteristics. 

There is no question but that Small Group Research represents 
an extensive undertaking. The sections not reviewed here—the re- 
lationships between substantively-defined variables, and the de- 
scription and critique of the field—will undoubtedly be read with 
interest by the small group researcher. The bibliography through 
1962 should prove useful, especially if supplemented with bibli- 
ographies previously supplied by Hare, Borgatta, and Bales, and 
by Raven. 

Small Group Research is a synthesis and critique of a field which 
the authors fail to define. True, the field involves social psychologi- 
cal research with small groups, but apparently not all small groups, 
if the 250 sample studies may be taken as a guide: not one of these 
deals with the most important of all small groups—the family. 

A Epwarp LEVONIAN 
University of California 
Los Angeles 


Mental Health and Achievement: Increasing Potential and De- 
creasing School Dropout by E. P. Torrance and R. D. Strom. 
New York: John Wiley and Sons, Inc. Pp. + 417. 

This paperback may be best described as “constructively non- 
conforming.” It is neither a co-authored text nor a book of readings. 
It isa refreshing compilation of published articles, re-written prev- 
iously published articles, original and re-writ 
and speeches, and never before published invited papers. 
j "The authors, in the preface, sta 
ie., ". . . to assist school personnel 
in the improvement of their roles 
tomorrow's adult.» To accomplish this objective th 


Typical of most multi-authored Works, and particularly those 


n onfronted with the prob- 
papers “hang together" in a unified, co- 
im ss Pu the book has been divided 
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ing, and Evaluation. Further cohesiveness is afforded the reader by 
the introductions to the sections, of which the first two were written 
by the junior author and reflect his work with Project Dropout 
for the National Education Association. 

The authors have not delineated an underlying theoretical model 
either explicitly or implicitly. There is a theme of “mental health” 
throughout all of the chapters which does provide a unifying con- 
ceptual thread. Unfortunately, the term is only defined once in the 
book (Farnsworth, p. 184) and in its positive context states that men- 
tal health is “freedom with responsibility, flexibility, self-reliance, 
and a genuine concern for the public welfare.” The majority of the 
papers presented reflect this broad umbrella-style definitive model. 

Many psychologists would agree with Confucious, who, in the 
Fifth Century, B.C. was purportedly asked by one of his followers, 
“Master, what is needed to change the world?” The sage is said to 
have replied, “A proper definition of things.” For those concerned 
with behavioral, operational, and consequently measurable defini- 
tive models for terms such as “mental health,” “creativity,” and 
“disadvantaged” there will be disappointment. If the reader enjoys 
the challenge of differential perceptions of these abstractions, he will 
find the reading thought-provoking. Л, 

If one wanted to utilize а content, analysis method of the book, 
one would find ample space devoted to the expected topics of mental 
health, achievement and the achievement motive, creativity, and 
school dropouts. In addition, however, numerous foci are afforded 
the role of the teacher, educationally disadvantaged youth, self- 
concept, adolescence, peer groups, slow learners (though no mention 
is made of Kephart’s work), testing as related to mental health, dif- 
ferential types of school programs, and the learning process. 

Because the book previews new directions in research related to 
the educationally disadvantaged (Section I), highlights innovative 
programs in instructional methods and materials and prognosticates 
future trends in school programs (Section Ш), and brings to the 
reader fresh insights into human dynamics (Section III) it could 


well be required reading for the educational sophisticate. For the 


pre-service teacher the competent re-statement of basio guidance 
principles in context of the maximum development of all human 


resources would qualify this work as а “must” for the neophyte. 
Newton S. METFESSEL 


University of Southern California. 
Милвер Murry 
Los Angeles City Schools 


Biddle and 


s h by Bruce J. 
Role Theory: Concepts and Research Dy Pa 


Edwin J. Thomas. New York: John Wiley & Sons, 
+ 453. 
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The growing concern and interest in role theory is illustrated by 
an increase in published articles from fourteen in 1932 to three 
hundred and forty-three in 1962. Since role theory is a new field of 
inquiry, the authors have attempted to identify, articulate, and 
analyze the component aspects of role theory as a step in crystalliz- 
ing role theory as a recognized specialization in the behavorial sci- 
ences, 

The authors delineate the domain of role theory as that which is 
concerned with the entire person-behavior matrix, which has chosen 
for its domain of study real-life behavior as it is displayed in gen- 
uine on-going social situations. It examines the processes and phases 
of socialization, interdependences among individuals, the character- 
istics and organization of social Positions, processes of conformity 
and sanctioning, specialization of performance, and the division of 
labor, and many others. 

The book consists of two parts: the first a relatively short section 
of essays, the remainder a selection of forty-seven readings. The 
essays serve as a text covering the “Nature and History of Role 
Theory,” “Basic Concepts for Classifying the Phenomena of Role,” 
“... The Properties of Role,” and “... The Variables of Role.” The 
‘authors state “. . , in the essays prepared for this book, an effort was 
made to articulate the domain of role theory through an analysis 
of the component activities of the field, their history, and current 
status; and to present the basic role concepts along with some fea- 
tures of their underlying structure.” 


Patrici T. Воткік 
University of California, 
Santa Barbara 
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Psychological Measurement and Prediction by Paul Horst. Belmont, 
California: Wadsworth Publishing Company, Inc., 1966. Pp. 
xii + 455. $8.95. 


This book by Paul Horst, based on thirty-five important years of 
activity in virtually all facets of measurement and prediction, 
clearly reflects his values, beliefs, and interests; and in this sense 
is a personal book rather than an eclectic text. Т found it important, 
stimulating, valuable, carefully prepared, up-to-date, episodic, and 
frustrating at times. 

Although it has many sections that a bright college freshman 
would find quite understandable, the book definitely is not an intro- 
ductory text, but is more a successor to Gulliksen’s Theory of Mental 
Tests (1950). For the reader, some background of statistics is essen- 
tial, a little knowledge of tests and measurement would be helpful, 
and some familiarity with matrix algebra would facilitate the under- 
standing of most of the book and, at the least, permit the reading 
of all of it. Providing a simple setting for the later material, the 
first four of the twenty-seven chapters are a verbal and nontechnical 
survey of topics in measurement in psychology, test domains, types | 
of tests, and a number of aspects of the conditions of test adminis- 
tration. Subsequent chapters become increasingly mathematical. 

For the student the chapters on covariation, multiple measures, 
item and test statistics, item analysis, and classical measurement 
theory will be particularly valuable and clear. The sections оп. the 
effects of chance on test statistics, multiple prediction, and reliability 
are quite good, particularly, for the reader able to put them in the 
context of other presentations. The fifteen pages on factor analysis, 
which are almost completely verbal, are equal to any similar 
introductory chapter on the topic. The material on test homogeneity 
and the effects of sample selection are particularly stimulating, but 
ыы sophistication is probably required to appreciate them 

1 
ш ] measurement and statistics the cov- 


As a book on psychologica stati 
erage i lete, although, as Horst indicates, psycho- 
таве іа ee cal "models, and the recent, de- 


physics, scaling theory, mathemati ‹ 
ХЕР. in pee are not treated. The more technical 
chapters are each followed by & section on mathematical proofs, 
resulting in the opportunity for the student to follow the line a 
reasoning in a chapter without having, at first reading, to w c 
through a complicated and difficult set of equations. This is a usetu 


mode of presentation. Another unusual feature of the text is the 


presentation of many of the mathematical statements in both scaler 


and matrix form, The use of matrix notation may stun the Бейш 
reader, but any serious student might as well face up early to the 
ubiquitous and useful role of matrix algebra in measurement. 
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Reflected throughout the book is Horst’s longstanding interest in 
pragmatic aspects of measurement and prediction; for example, 
Frederic Lord's recent work is not included because Horst cannot 
see how it could be incorporated, in its present state, into an essen- 
tially practical approach largely concerned with the problems of 
prediction. Applied interests, however, have not led to a superficial 
book. The many mathematical proofs and technical and theoretical 
discussions will satisfy the advanced student. In fact, the teacher 
considering the book as a text for a course would probably prefer 
more examples and hope for some problems and answers. Clearly, 
this book concentrates on applied problems in a special manner, not 
in a cookbook fashion, but by treating theory underlying practical 
measurement and prediction issues. Horst’s belief in the importance 
of the binary item is evident, and the reviewer is very sympathetic 
to his emphasis that test scores are based on items scores and thus 
test statistics are based on item statistics. This is one of the best 
aspects of the book. 

Most of the chapters are based on lecture notes that Horst has 
developed over the years. The individual chapters are well organized 

D and coherent, but the connections between the chapters are not 
always clear; because of this the reviewer found the book somewhat 
episodic, However, Horst cannot be faulted for the lack of any 
acceptable, coherent theory of Measurement, especially in the light 
of his many important contributions to the field. 

Every book, of course, is dependent on the choices of the author 
of what to cover and of what viewpoint to use. Sometimes an author 
chooses to write an eclectic text. This book, however, as mentioned 
above, is a personal one. This is one of its many strengths; but it, of 
course, provides some frustrations, since few readers will agree com- 
pletely with Horst’s approach and choice of topics. Perhaps some 
day someone will write the Great Measurement "Theory Book. Paul 
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T. В. Нозик 
University о} California, 
Los Angeles 


Test Theory by David Ma 
4 gnusson. (Translated by H: t 
Reading, Massachusetts: Addison-Wesley, 1966. Pp. dT 
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'This compact little book on test theory is a translation of the 
second edition of the author's Testteori, originally published in 
Swedish. Although there are occasionally abrupt changes of topic 
in mid-paragraph throughout the book, the translator has been 
effective in making the book appear as if it were originally written 
in English. 

A fairly elementary overview of classical test theory is given, 
but the student will need a statisties course or two to understand 
most of it. The first four chapters are basic material on scales of 
measurement, variance, covariance, and correlation. Chapters 5 
through 9 are concerned with the problem of test reliability, and 
include useful information on the standard error of measurement, 
the reliability of differences, and the error variance. Chapters 10 
through 12 discuss validity, prediction, and selection, Some intro- 
ductory material on factor analysis is presented in Chapter 13. The 
last three chapters in the book, Chapters 14 through 16, deal with 
the topics of item analysis, guessing, scales, transformations, and 
norms, The author has also included a number of selected exercises 
for each chapter, with answers at the end of the book, together with 
an appendix on the normal probability integral. 

Admittedly, the mathematically unsophisticated student will need 
assistance on certain chapters in this book, e.g. Chapters 9 and 13, 
but the reviewer has yet to encounter a more lucid presentation of 
classical test theory. Consequently, the book is heartily recom- 
mended for upper undergraduate and beginning graduate courses in 


tests and measurements. 
Lewis R. AIKEN, JR. 


Guilford College 
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SYSTEMATIO CONSTRUCTION OF DISTRACTORS 
FOR ABILITY AND ACHIEVEMENT TEST ITEMS* 


LOUIS GUTTMAN ax» I. M. SCHLESINGER 


The Israel Institute of Applied Social Research and 
The Hebrew University, Jerusalem 


DisrRACTORS, according to the definition given by English and 
English (1958), “are designed to be attractive . . . to the respond- 
ent who does not know the correct answer.” Keeping the correct 
answer company is usually regarded the only function of dis- 
tractors, and sufficient attraction is deemed sufficient qualification 
for being a good distractor. Hence, distractors are usually con- 
structed initially on the basis of intuition as to what answer might 
be attractive, or else by gleaning answers from tests which are 
first presented in open-ended form. By contrast, it will be argued 
here that distractors can be constructed in a systematic fashion 
on an a priori basis, which provides at least three desirable fea- 
tures not possessed by less systematic approaches: 


1. Successful prediction of relative empirical difficulties of dis- 
tractors, 

2. Reduction of variation in test results due to undesired factors, 

3. Possibility of differential scorings of subjects on the types of 
wrong answers to which they are attracted. 


These features may enable construction of tests of shorter length 
than usual for desired reliability and validity, and increase the 
possibilities of using a test for diagnostic and remedial purposes 
with respect to habitual errors of individual subjects. 


peu 
1Some of the research reported in this study was supported by the U. 8. 
- Department of Health, Education and Welfare under CRP No. ОЕ-4-21-014 


and No. OE-5-21-006. t jl 
"Thanks are due to Miss Marlyn Grossman who helped with the analysis of 
part of the data reported herein. 
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“Noise” Generated by Unsystematic Construction of Distractors 


The test items with which we are concerned are of the “closed” 
form. Hence, an item is defined by two parts: the question, and 
the possible answers or distractors (cf. Guttman, 1965b, pp. 167- 
8). 

aud are an integral part of the stimulus situation and 
determine the degree of difficulty of an item in conjunction with 
the question part. The question part of an item will not correspond 
to any definite level of difficulty when construction of distractors 
is haphazard. ј 

When it is intended to tap different aspects of ability or different 
sub-areas in an area of achievement, the analysis of results may 
aim at revealing the resulting patterns of interrelations. Such anal- 
yses have been made, for example, by Gabriel (1954) for Raven's 
Progressive Matrices, and by Guttman (1957, 1965a,b) for Thur- 
stone’s battery of intelligence tests. These are actually reanalyses 
of results of tests constructed by previous investigators, in which 
distractors were not systematically constructed, and hence may 
have introduced a “noise” factor which depresses the obtained 
correlation coefficients to an unknown degree. In principle, unsys- 
tematic choice of distractors may even affect the correlational con- 
figurations of items or subtests, and not merely the overall size of 
the coefficients, 

The investigation of interrelationships among abilities may be 
furthered by a more systematic approach to the construction of 
the content of the question parts of test items, as well as of the 
distractor parts. Questions are typically constructed by a kind of 
trial-and-error procedure where the intuition of the investigator is 
subsequently checked by some form of item an 


; alysis. The fallacy 
of using such statistical techniques as a criterion for content of 
questions (and distractors 


) has been pointed out long ago (Gutt- 


I К and leads to а construction of 
items on the basis of an q Priori definition of test content. Sys- 
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tematic construction of distractors for the same questions is facil- 
itated by the same facet designs; an example is given in the next 
section. Such a formalization is especially important for the study 
of interrelationships between items or sub-tests which has been 
mentioned above. Further, this approach is also important in the 
construction of parallel forms of a test. Facet design leads to a 
more complete definition than usual of the universe of items for 
which only a sample is constructed for a given test. Unless such a 
design is employed, there is no way of predicting beforehand 
whether items included in two parallel forms should be of com- 
parable difficulty, even granted they belong to the same universe. 
The same applies also to distractors of the items. 


The Use of Information from Systematically 
Constructed Distractors 


In the following, two examples will be given showing how dis- 
tractors may yield useful information which cannot be obtained 
merely by dichotomizing answers into those which are correct and 
those which are incorrect, nor even by ranking from “most” to 
“Jeast” correct. The examples will also serve to illustrate how dis- 
tractors may be constructed through facet design. The first example 
is on the construction of different degrees of difficulty, and the 
second is on the construction of systematic kinds of errors. 


Example 1: Distractors Varying in Degree of Attraction 


Distractors usually vary in the degree of their attraction for 
the respondent, namely the proportion of respondents who choose 
it when they don’t choose the correct answer. It may be hypothe- 
sized that the degree of attraction of a distractor increases mono- 
tonely with its “degree of similarity” to the correct answer. A 
viable a priori definition of "degree of similarity" is one based 
only on considerations of content, yet successfully predicts empir- 
ical attractiveness. Such а definition is provided by a particular 
facet design, as will now be illustrated. 

Consider an item of а test employing figures varying on three 
facets: e.g., shape, size, and orientation. Table 1 gives а design 
for a set of eight distractors, including the correct answer, for 


such an item. 
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TABLE 1 


Eight Distractors of an Intelligence Test Item in Which Figures 
Vary in Three Facets, A, B, and C 


Distractor Facet A Facet B Facet C 
1 а bi e 
2 а b са 
3 а bs e 
4 а bs [71 
5 аз bı e 
6 аз bi [7] 
yi а b. в 
8 а bs e 


Let any one of the distractors represent the correct answer. Then 
there are three other distractors which differ from this correct 
answer on one (out of the three) facets, three distractors which 
differ from the correct one on two (out of the three) facets, and 
one distractor which differs on all three facets. Suppose distractor 
1 of the table is the correct answer. Distractor 2 differs from it in 
Facet C only, distractor 3 in Facet B, and distractor 5 in Facet A. 
Three other distractors (4, 6, and 7) differ from it on some two 
facets each, and distractor 8 is wrong on all three facets. The 
structure of a priori dissimilarity of the distractors from the cor- 
rect answer, within the definitional system of the three facets, is 
expressible as a partial order as shown in Figure 1. 


ас, 


С» 


Figure 1. The structure of a iori dissimilariti 
he | X a priori dissimilarities of distractors to 
a Tee the definitional system of the dichotomous Facets Др ва 
* (Distractor а, 162 &ppears twice for Convenience of presentation.) Re 


a,b 
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In a recent study (Guttman and Schlesinger, 1966), similarity 
of distractors was manipulated as described here. Each test item 
was essentially a three by three matrix of small squares as shown 
in Figure 2. 


PREJE) 


Figure 2. An example of analytic ability test items. 


In eight of these squares a figure appeared, and the missing 
figure of the ninth square had to be supplied by the subject. 
Figures varied in the following facets: (a) Shape, (b) size, (c) 
orientation, and (d) place. Different subsets of two or three of 
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these facets were employed in different items. Thus, in a certain 
item figures might vary in respect to shape, orientation, and place, 
while size remained constant. The correct answer might then be 
defined as, e.g., shapes, orientation; place;. In another item, only 
size and place might be varied, and the correct answer might be 
Size», place;. Since each given profile ajb;c, (or абу) generally has 
different interpretations for different items, only a crude average 
analysis was carried out, taking into account only the number of 
facets in which a distractor differed from the correct answer. The 
partial order is meaningful among the distractors within an item, 
and not necessarily between items. It was assumed that an aspect 
of the partial orders which would be approximately comparable 
across items would be the number of facets on which the profiles 
differ. In theory, this need not be strictly true. The degree of 
attraction of distractors is measured by the mean number of re- 
spondents choosing the distractor. As shown in Table 2, defini- 
tional similarity predicts degree of attraction, and this result is 
replicated in six subtests of an analytical ability test. Three of 
these subtests (I-III) comprise items in which two facets are 


varied; and another three (IV-VI) comprise items in which three 
facets are varied. 


TABLE 2 


Average Number of Subjects Choosing a Distractor 
Incorrect on One, Two, or Three Facets 
Subtests with Items Varying Subtests with Items Varying 
in 2 Facets in 3 Facets 
EM А 
No. of Facets I II ш тү v VI 
Incorrect (N = 487) (N= 469) (N =332) (N= 323) (№ = 332) (N = 321) 
um na 14.2 17.5 18.1 22.5 mf 
о 6 5.2 5.7 12.0 1. 
three 


| 9.2 
(not applicable) 1.8 3.3 6.3 


These results show that distractors can be constructed so that 
their order of difficulty is 


two scores according to the degre 


е of similarity of the distractor 
chosen (in the ab 


ove example: 0, 1, 2 or 3 attributes incorrect). 
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Thus, the usefulness of even short tests can be increased through 
use of such differential scores. 


Frequencies of Facet Elements 


Note that the above design has the property that the two values 
of each facet appear with equal frequency in the set of eight dis- 
tractors. Hence, the respondent can get no clue from the relative 
frequency of these values as to which one is correct. A contrasting 
example is Raven’s Progressive Matrices, where in some of the 
items certain aspects of the correct answer appear in more than 
half of the distractors. This seems undesirable, since it is difficult 
to tell to what extent respondents are influenced, with or without 
awareness, by such clues and whether these clues operate as а 
help or a hindrance. 

Whether or not a set of distractors can be made to have an 
equal number of distractors for each of the values, depends on the 
relationship between the number of distractors and the number 
of facets. Consider a set in which only two facets are varied. Thé 
investigator has the following choices: 

1. Introduce a third facet and employ a design similar to that 
of Table 1. This may have the possible disadvantage that four 
of the distractors—namely those exhibiting incorrect values in the 
additional facet that does not vary elsewhere in the item—will 
be so unlikely a choice as to attract virtually no respondents at 
all. 

2. Vary only two facets, but employ a different number of dis- 
tractors (e.g., distractors 1-4 in Table 1 when only Facets B and 
C are varied; alternatively three values may be employed for nine 
distractors). This solution seems to be preferable to (1) above. 
However, when the test includes items differing in the number of 
facets varied, it may introduce an additional factor of item diffi- 
culty through varying the number of alternative answers presented. 

3. Finally, of course, the objective of having each value equally 
represented may be abandoned. For example, items varying in two 
facets may have eight distractors, by employing three values ap- 
pearing with unequal frequencies, as follows: аЬ, asb;, аЬ», gba, 
@3b;, asba, dibs, абз. The requirement of not introducing unac- 
counted for variation can be satisfied by employing this, or any 
other, design consistently for all test items varying in two facets. 
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Example 2: Distractors Representing Different Types of Error 


Besides varying in degree of attractiveness, the distractors of 
Table 1 fall into various types by virtue of their content: those 
that are wrong in Facet A (5, 6, 7, 8), those wrong in Facet B 
(3, 4, 7 and 8), and those wrong in Facet C (2, 4, 6 and 8). 
Respondents can, therefore, be classified in respect to errors pre- 
dominantly made by them, e.g., whether they are more liable to 
make mistakes on size, orientation or shape (if these are the facets 
varied). In an analytical intelligence test, perhaps no evident ad- 
vantage accrues from such a classification. However, in achieve- 
ment tests, classification according to type of error may be 
utilized for diagnostic purposes, 

A priori classification of types of errors can also be made with- 
out a multifacet design, namely by using a single facet. While this 
may not predict relative attractiveness, it does enable differential 
scoring of the contents of the attraction. The following example is 
‘taken from a study in progress: 

An arithmetic test for eighth graders included 16 questions per- 
taining to percentages, the following being a typical item: 

475 kgs. of sugar were delivered to a grocery store; 48 per cent 

of the sugar was sold on the first day. How many kgs. of sugar 

were sold on that day? 

(1) 475 

(2) 218 

(3) 989 

(4) 228 

(5) other 


of distractors in this test employed the following types: 


i Application of Wrong formula 

( Copying a number a; ingi 1 
ppearing in the question 

(c) A number close to correct answi 

(d) "Other" M 
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(Some items of the test actually included а "miscellaneous" cate- 
gory; but ideally all distractors would conform to the same pat- 
tern). 


Empirical Smallest Space Analysis (G-L SSA-II) 


The value of this a priori classification can be assessed by an 
analysis of test results. If there exists a tendency for respondents 
choosing one kind of distractor in one item to choose the same 
type in other items, then it can be said that this type is psycho- 
logically relevant. An analysis of test results of about 2000 eighth 
graders was carried out using the Guttman-Lingoes SSA-II non- 
metric computer program. This is an asymmetric analysis render- 
ing the smallest possible Euclidean space which preserves ordinal 
information about distance (for further information see Lingoes, 
1965; Guttman, 1966; Laumann and Guttman, 1966). An informa- 
tion function, Pz»|y, about distances was employed in this analysis, 
defined as follows: 


Let Nasis; = number of respondents who chose distractor a on 
item 1 and distractor b on item j 
Naus = number of respondents who chose distractor a on 
item $ and who answered incorrectly on item 7 
then: Pasis; = Nasisi/ Naiti 
This is asymmetric in a and b. 

An example of values of Pas|y for three items is given in Table 3. 
Note that the columns in each submatrix add up to 100 percent and 
that the matrix is asymmetric: for any two items the submatrix 
above the diagonal generally differs from that below the diagonal. 

The present analysis was carried out by rows, i.e. by likelihoods, 
A comparison by columns would have meant comparing condi- 
tional probabilities. Likelihoods are usually less dependent than 
conditional probabilities on the marginal (or unconditional) dis- 
tributions of the items, and may often be expected to give a smaller 
space. For the present data, similar two-dimensional spaces were 
derived from the two kinds of specifications, the likelihoods giving 
a somewhat more interpretable picture. 

Figure 3 shows the first two dimensions of the three-dimensional 


| 
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TABLE 3 | 


Ezample of Joint Distributions of Answers to Four Distractors 
in Several Items, in Percentages 
a 3 
$ j k 
Die -———— e V 
Item tractor a b end a b c d «e a b c d 
$ а 100 40 27 34 16 34 11 32 2 
b 100 26 41 31 16 31 38 22 16 
с 100 14 17 10 15 07 10 19 10 
d 100 20 15 25 53 з 41 27 & 
A — a eme 
Total 100 100 100 100 100 100 100 100 100 100 100 100 
j а 23 15 16 09 100 19 10 12 06 
b 18 28 23 08 100 п 21 13 04 
c 36 32 19 21 100 34 26 35 20 
d 23 25 42 62 100 36 43 40 70 
E E у suo. 100 о И 
Total 100 100 100 100 100 100 100 100 100 100 100 100 
k a 30 29 13 19 39 95 2 20 100 
b 00 21 13 17 п 5 13 14 100 
с 49 37 58 33 39 42 43 32 100 
d 5 13 16 и 07 15 34 100 


и „ЗЕ 
Total 100 100 100 100 100 100 100 100 100 100 100 100 


Space obtained b 
items represented h 


It is possible, therefore, to assign a score for each type of error. 
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Figure 3. The first two dimensions of the three-dimensional space obtained 
by SSA-II for distractors of eleven arithmetic-test questions. Questions are on 
principal (P), rate (R), or yield (Y). Points relatively high on the third di- 
mension are indicated by an asterisk. 
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MODERATING HETEROSCEDASTICITY 


EDWIN E. GHISELLI AN» ERIC P. SANDERS 
University of California, Berkeley 


Ох of the basic postulates of classic psychometric theory is that 
the magnitude of the error with which a test measures is the same 
for all individuals. This is held to be true whether reliability or 
validity of measurement is being considered. In the case of re- 
liability, on any one given administration of a test the error scores 
are taken to vary in magnitude from one individual to another. 
For some individuals on this administration of the test the error 
of measurement is large, whereas for others it is small. However, 
over many parallel tests the standard deviation of error scores is 
taken. to be precisely the same size for all individuals. More ex- 
actly, it would be said that as the number of parallel tests in- 
creases without limit the standard deviation of every individual's 
error scores, the standard error of measurement, approaches the 
same value for all individuals. Consequently a given test is con- 
sidered as measuring all individuals with the same degree of re- 
liability. In a similar fashion the errors of prediction are taken to 
be of the same magnitude for all individuals. On any one ad- 
ministration of a given predictor and criterion, the error with 
which the former predicts the latter varies in magnitude from one 
to another. Consequently on а particular administra- 
tion of the test and the eriterion, for some individuals the error 
of prediction is large whereas for others it is small. However, over 
many parallel predictors and criteria, the standard deviation of 
errors is taken to be the same size for all individuals. Again it 
would be more proper to say that as the number of parallel 
ut limit the standard deviation of 


criteria and tests increase witho ion 
every individual's error scores, the standard error of prediction, 


individual 
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approaches the same value for all individuals. Indeed, granting 
certain basic assumptions about the nature of true and error scores, 
the validity of these two propositions seemingly can be demon- 
strated mathematically (Ghiselli, 1964; Gulliksen, 1950). 

To put it another way, classic psychometric theory holds that 
the correlation between any two tests, as for example two parallel 
tests or a predictor and a criterion, is necessarily homoscedastic. 
Indeed, whether one does or does not hold to classic psychometric 
theory, the common belief is that errors of measurement and of 
prediction are the same in magnitude for all individuals, and 
homoscedasticity is assumed to hold in all relationships. 

There have been a number of demonstrations that the degree of 
relationship between two tests may systematically vary for differ- 
ent subgroups drawn from a total group of individuals (Ghiselli, 
1963). That is, on the basis of a variable, termed a moderator 
variable, the individuals who form the total group can be sorted 
out into a series of two or more subgroups in which the degree of 
“relationship between two tests varies systematically from sub- 
group to subgroup. The subgroups may, of course, be considered 
as being infinite in number, and hence each as occupying merely 
а point on the moderator continuum rather than a broad zone. In 
such a case individual differences in the moderator variable would 
be thought of as being related to the degree of relationship be- 
tween two tests. In other words, scores on the moderator variable 
would be correlated with the degree of error of measurement or 
error of prediction. For individuals who fall at one extreme on 
the moderator variable errors of measurement or prediction are 
large, whereas for those who fall at the other extreme the errors 
are small, 

Examinations of the effects of moderator variables so far have 
been directed solely at the variation in the degree of correlation 
between two tests for different subgroups. While no statements are 
made about homoscedasticity of relationship, 
would seem to be that while the degree of correlat 
tests varies among the moderated subgroups, 
relationship between the two tests is homos 
that in each subgroup the error of measure 


would be taken to be of exactly the same 
dividuals. 
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Now suppose it were possible to demonstrate that by means of 
a moderator variable a group of individuals could be segregated 
into subgroups which differed in heteroscedasticity. For example, 
suppose one could show that a moderator variable can divide & 
total group into two subgroups in one of which high scores on the 
two tests being related are closely associated and low scores are 
only slightly associated; whereas for the other group the reverse 
is true, with high scores being only slightly associated and low 
scores closely associated. If this could be demonstrated then one 
could conclude that errors of measurement and of prediction are 
not necessarily the same for all individuals. Furthermore, such 
findings would indicate that any given relationship, even if it be 
homoscedastie for a total group, can be thought of as a combina- 
tion of a series of subgroups which differ in heteroscedasticity. 

If two tests are perfectly related then each individual’s standard 
score on the test is precisely equal to his standard score on the 
other test. To put it another way, when two tests are perfectly 
related, the differences between the two standard scores for all 
individuals are zero. On the other hand if there is no relationship 
at all between two tests, then on the average the difference be- 
tween the two standard scores is large. So the average of the 
differences between the two standard scores for individuals is an 
index of the degree of relationship between the two variables. 
When the relationship is high the average difference between pairs 
of standard scores is small, and when the relationship is low then 
on the average the difference between the two standard scores is 
large. If the relationship between two tests is heteroscedastic, be- 
ing shaped say like a pear, then at one end of the bivariate fre- 
quency distribution the average of the differences between indi- 
viduals’ two standard scores is small, whereas at the other extreme 
of the bivariate frequency distribution the average of the differ- 
ences between the two standard scores is large. 

When the relationship between the two tests is positive, and 
heteroscedastically pear-shaped, the bivariate frequency distribu- 
tion would appear either as an upright pear, with the small end of 
the pear pointing to the upper right-hand corner of the scatter- 
diagram and the large end to the lower left corner, or as an Up- 
sidedown pear with the small end pointing to the lower left-hand 
corner of the scatterdiagram and the large end to the upper right- 
hand corner. This is illustrated in Figure d 
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Figure 1. Two heteroscedastic relationships pear shaped in opposite directions, 


Let the scatterdiagram be divided into two areas by running à 
line perpendieular to the line of relationships through the point 
representing the two means, as is shown in Figure 1. Thus the 
scatterdiagram is divided into two right triangles, one on the upper 
right-hand portion of the scatterdiagram which contains those in- 
dividuals whose scores are high on both tests, and the other tri- 
angle at the lower left-hand portion of the scatterdiagram which 
contains those individuals whose scores are low on both tests, If 
the relationship were homoscedastie, then the average of the differ- 
ences between the two standard scores for all individuals falling 
in the upper triangle would be precisely the same as the average 
of the differences between the standard scores of those individuals 
in the lower triangle. On the other hand, if the relationship were 
heteroscedastic, being shaped like a pear, then the individuals in 
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one triangle would show a small difference between their standard 
scores, whereas those falling in the other triangle would show a 
large difference. The relationships depicted in Figure 1 show this. 

Suppose a variable could moderate homoscedasticity. Then if the 
total group were divided into say two subgroups on the basis of a 
moderator, those individuals with high moderator scores and those 
with low moderator scores, the two subgroups would show opposite 
patterns in heteroscedasticity. That is, for those with high mod- 
erator scores the cases which fall in the upper triangle would show 
small differences between their pairs of standard scores, whereas 
those in the lower triangle would show on the average large differ- 
ences. The reverse would be true with those who have low mod- 
erator scores. For them, those individuals who fall in the upper 
triangle would show large differences between their standard scores 
on the average, whereas those who fall in the lower triangle would 
show small differences. 

The task of the present investigation is to ascertain whether 
it is possible to develop a moderator which will divide a total 
group into two subgroups in which the relationship between two 
tests shows opposite heteroscedasticity in the two subgroups. The 
question is, is it possible to find a variable which will divide & 
total group into two subgroups in one of which the average of the 
differences between the standard scores of the two tests being re- 
lated is small for the individuals in the upper triangle in the 
scatterdiagram and large for those individuals in the lower tri- 
angle, whereas the reverse is true for the other subgroup? 


Methods and Procedures 


The basic plan of the investigations to be reported here is as 
follows. The scores of a group of individuals are obtained upon 
two tests and an.inventory. From the inventory a scale is to be 
developed which serves to moderate the relationship between the 
two tests. The total group of individuals is divided randomly into 
two halves, one half forms the experimental group, and the other 
half a cross-validation group. Within the experimental group, sub- 
jects are divided into those whose scores on the two tests place 
them in the upper right-hand triangle of the bivariate aie 
distribution, and those whose scores place them in the lower le: 
hand triangle (see Figure 1). For each subject the difference, re- 
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gardless of sign, between his standard scores on the two tests is 
calculated. In each of the two triangles the subjects are divided as 
nearly as possible into two halves, that half for which the differ- 
ences between the two standard scores regardless of sign are 
smallest, and that half for which the differences are largest. We 
have, then, four subgroups of the original experimental group, those 
in the upper triangle for whom the average of the differences be- 
tween the standard scores is small, and for those for whom it is 
large; and similarly for those cases who fall in the lower triangle, 
those for whom the differences in the standard scores are small, 
and those for whom it is large. 

From these four groups two experimental groups are formed. 
The first group consists of those individuals who fall in the upper 
triangle for whom the differences in standard scores are small 
plus those who fall in the lower triangle for whom the differences 
in the standard scores are large. These individuals represent the 
upright pear type of heteroscedastic relationship. The second group 
is comprised of those individuals who fall in the upper triangle 
for whom the differences in standard scores are large plus those 
who fall in the lower left-hand triangle for whom the differences 
in the standard scores are small. These individuals form an up- 
sidedown pear type of bivariate distribution. 

Using these two experimental groups an item analysis is per- 
formed on the forced-choice inventory, seeking items which sig- 
nificantly differentiate the two experimental groups. From these 
differentiating items a scale is formed which constitutes the mod- 
erator variable. If the moderator variable works properly, then 
those who obtain high moderator scores would form a frequency 
distribution which resembles the upright pear, whereas those who 
earn the low moderator scores would form a bivariate frequency 
distribution similar to an upsidedown pear. 

The next step is to apply the moderator scale to the cross- 
validation group. As nearly as possible the cross-validation group 
is divided into those individuals earning the upper half of mod- 
erator scores, and those individuals earning the lower half of mod- 
erator scores. Each of these two cross-validation groups is further 
divided into those who on the basis of their scores on the two tests 
fall in the upper triangle of the bivariate frequency distribution 
and those who fall in the lower triangle, 
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If the moderator variable operates as it should, then the first 
group should form an upright pear-shaped bivariate frequency dis- 
tribution, and the second group should form an upsidedown pear. 
In the first group, those earning high moderator scores, the in- 
dividuals falling in the upper triangle, should have a smaller av- 
erage difference between their standard scores on the two tests 
than do those falling in the lower triangle. In addition, in the 
second group, those earning low moderator scores, the individuals 
falling in the upper triangle should have a larger average differ- 
ence between their standard scores on the two tests than do those 
falling in the lower triangle. 

This plan was applied to three groups of individuals. Group A 
consisted of men who held managerial positions in some business 
or industrial establishment. The subjects were randomly divided 
into two groups of 155 each, one of which served as the experi- 
mental group and the other as the cross-validation group. The 
two tests to be moderated were an intelligence test and a test of К 
perceived occupational level. The relationship between the scores 
on these two tests was moderately high, being .62. The inventory 
from which a moderator variable was to be developed was а 64 
item forced-choice inventory in which the individual chooses be- 
tween two self-descriptive adjectives by indicating the one he be- 
lieves more characterizes him. The number of items which signifi- 
cantly differentiated between the two experimental groups, those 
individuals forming an upright pear and those an upsidedown pear, 
was 10. 

Group B consisted of 368 men who formed an approximate cross- 
section of the population of employed men, half of whom formed 
the experimental group and half the cross-validation group. The 
two tests given them were an intelligence test and а test of super- 
visory ability. The correlation between these two tests was lower 


than that between the two tests for Group A, the coefficient being 
develop a moderator for 


37. The inventory which was used to i d 
Group A was also used with this group. Again 10 items were 


found to differentiate between the two experimental groups, but 
found to be significant, for 


these items were different, from. those 


Group A. ; | : 
The subjects in Group C also were men in managerial positions. 


In this case 236 of them formed the experimental group and a like 
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number the cross-validation group. The two tests to be moderated 
were an intelligence test and ratings of success in managerial posi- 
tions. These two variables were only slightly related, the coefficient 
of correlation between the two sets of scores being only .16. The 
inventory used to develop a moderator variable again was a forced- 
choice inventory. It consisted of 156 items and included not only 
personally descriptive items but also items bearing upon occupa- 
tional, educational, and avocational interests. Of these items, 15 


were found to significantly differentiate between the two experi- 
mental groups. 


Results 


The effects of the moderator variables applied to the cross- 
validation groups are shown in Table 1. If the moderator variable 


TABLE 1 
Average Differences between Standard Scores (fz, — 2y|) for Individuals 
with High and Low Moderator Scores 
High Moderator Scores Low Moderator Scores 
(Upright Pear) (Upsidedown Pear) 
Upper Right- ^ Lower Left- Upper Right- ^ Lower Left- 
Hand Hand Hand Hand 
Group Triangle Triangle Triangle "Triangle 
e —a М |] N lz — z| М | — z| N 
A .82 58 +33 19 .45 33 33 44 
B 42 95 10 68 12 62 76 29 
с 1.00 74 1.15 65 1.13 52 .83 
Average .60 .78 .67 


аге га{һег small, їп one case being di 
decimal point, However, 
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possible to moderate homoscedasticity for the tests applied to these 
three groups of individuals. For each group it proved possible to 
separate them into two subgroups which showed heteroscedasticity 
of opposite sorts. 


Conclusions 


From these data, it appears that the implicit assumption of 
classic psychometric theory, that relationships are homoscedastic, 
is open to question. If the assumption is correct, then the standard 
errors of measurement and prediction should be the same for all 
individuals. However, for three instances described here, the data 
indicate that this is not the case. It.was demonstrated that a 
moderator could be developed which differentiated two subgroups 
displaying different patterns of heteroscedasticity. For individuals 
who scored high on the moderator, those with high scores on both 
tests, on the average, exhibited smaller differences between their 
pairs of standard scores, whereas those who scored low on both X 
tests, exhibited larger differences in their standard scores. The re- 
verse was found to be true for those individuals who scored low on 
the moderator. 

For two of the three groups utilized in this study, the same 
inventory was used to develop a moderator. However, neither group 
was moderated by the same set of items. This reemphasizes the 
point made by Saunders (1956) and Ghiselli (1963) that it is 
patent that moderator scores are specific to the tests in question 
and to the situation itself, and that their relative merits must be 
looked at in the light of each specific problem. 

An important implication of this study is in regard to sequential 
testing or sequential prediction. Specifically, if one uses a predictor 
(Pa) and if an individual (11) scores high on a moderator, and an- 
other individual (12) scores low on the same moderator, it can be 
asserted that the prediction using P, is good for I, but poor for Iz. 
If another predictor P, is used and I, scores low on another mod- 
erator and [2 scores high on that same moderator, one can say that 


the prediction using P» is good for J; but not for I2; Then suppose 


that a third predictor P, is applied with still another moderator, and 


I, scores in the middle of the distribution and I also scores in the 
same place. Here it can be stated that the prediction using P, is not 
too good for either І, or Iz Taking these three predictors (Pay P» 
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P.) and these two individuals (I; and 12) it can be observed that 
P, and Р are good predictors for I, but not for Is, and that Р, is 
not a good predictor for either J, or 12. 

This brief illustration emphasizes the salient point that if one 
begins with the assumption of a homoscedastic distribution of mod- 
erate predictability, but a moderator identifies differentially 
heteroscedastic subgroups with greater predictability, then one can 
predict more accurately by using a prediction system based on the 
assumption of a heteroscedastic distribution. Rather than starting 
to classify people on the basis of some prediction system that 
makes specific a priori assumptions regarding population distribu- 
tions, it would be well to first seek a moderator variable which 
would differentiate various subgroups (ie., heteroscedastic) and 
then apply a specific prediction system for those individuals within 
a differentiated subgroup distribution. 
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A NOVEL APPLICATION OF THE COEFFICIENT OF 
CORRELATION IN THE ISOLATION OF BOTH TYPAL 
AND DIMENSIONAL CONSTRUCTS 


LOUIS L. McQUITTY 
Michigan State University 


Onm way to judge the appropriateness of a statistical technique 
to the analysis of data is to consider it in relation to the theory 
being investigated; the technique should be internally consistent 
with the theory. 

The approach of this paper revises the general application of 
the coefficient of correlation in order to render it appropriate to à 
particular theory of types. 

A type is here defined as a category of individuals with a unique 
pattern of characteristics; everyone in the category possesses all 
of the characteristics, and anyone not in the category does not 
possess all of the characteristics of the pattern. 

The logic of the development of the revision is applied to show 
that the revised application of the coefficients of correlation can 
prove to be more appropriate than are the customary approaches. 
The logic of the revision together with the results from revised 
applications to certain kinds of data can place in question some of 
the customary applications and the interpretations which generally 
derive therefrom. The revision expands the possibility of new ap- 
plications for other statistics. 

The revision facilitates the discovery of the kinds of data to 
which the method is uniquely appropriate if these classes of data 
do in fact exist. Even if these kinds of data do not exist, the 
method shows promise with respect to customary data, and it sug- 
gests and develops novel and promising applications of other sta- 
tistics to customary data. 
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Development of the Method 


Consider a matrix which reports a coefficient of correlation of 
every individual of a sample with every other individual of the 
sample. 

Let ту represent any coefficient of correlation in the matrix be- 
tween any two individuals, $ and j. Let the numbers 1, 2, 3—i, j— 
n refer to the n individuals of the study. 

The matrix usually provides an opportunity to obtain several 
other indices of the covariance between i and j, which can be 
shown to have considerable promise. 


In developing a promising approach to another index of covariance 
in the corresponding 7,;'s and r,,’s, we can study the degree of covar- 
іапсе in the corresponding 7,,’s and r,;'s аз we read down the two 
Columns 7 and j of a matrix. More specifically, we can compute the 
Pearsonian Coefficient of correlation between i and j by using the 
paired r's of Columns 7 and j as we read down Columns 7 and j; 

' every row, such as any Row k, produces a pair of values; т»; and 
тъ. This index can be called an Intercolumnar Coefficient of Cor- 
relation and designated I,,,,' or more simply J;,;’. Before computing 
the Гз, plus one is entered in the diagonal cells (those reporting the 
correlation of an individual with himself). 

When the above approach is applied to every column with every 
other column, it produces a new matrix Г. The process can be re- 
peated again and again to produce I° of I,,"s, I* of Г.в, I* of 
I,;!s—I" of I,,"s, where the Г, I*, I—I" refer to the successive 
matrices and the I ,;"s, I;;”’s, I,,"s—I,;"s refer to the entries within 
each successive matrix; N is the number of applications required 
to stabilize the matrix; Г," = I,,"*!, except for differences so small 
that they ean be ignored. The first application of the method gen- 
erally ends by isolating one or more submatrices (generally more 
than one) in which all coefficients of every submatrix are plus one 
(or very nearly so). Each submatrix defines a statistical type in 
which every individual is identical with every other individual, i.e., 
in terms of the characteristics which they have in common as isolated 
by the method. Each submatrix has its unique pattern of common 
characteristics; no individual of the sample and outside of the sub- 


matrix has the complete pattern of characteristics represented by 
the submatrix. 
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Characteristics which are eliminated as error variance at one 
level of classification are redefined and retained at some lower 
level of classification. Once the first matrix has yielded one or 
more submatrices, each with near plus one entries only, the method 
is repeated. In the repetitions, the method is applied successively 
to each submatrix. Before the submatrices are analyzed, the unities 
(or near unities) in their cells are replaced by the corresponding 
r’s of the original matrix. Analysis of the first set of submatrices 
produces a second set of submatrices and thus a second level of 
classification. The method proceeds in this general fashion to di- 
vide and redivide until every one is usually in a category of just 
one individual at the bottom level of classification; two or more 
individuals would be in a single category at the bottom of the 
classification if and only if they agree in their answers to each of 
all of the items. 

Every individual is classified in one and only one category at 
every level of classification. 

The hierarchical classification is built from the top down with 
every analysis using all of the coefficients of the matrix or sub- 
matrix to which it is applied. 

If we pair every submatrix of any one level of classification 
with every other submatrix of that level of classification, we find 
either one or two categories of pairs. In one category of pairs, the 
unique pattern of one matrix is completely independent, in one 
sense, of the unique pattern of the other; there are no character- 
istics common to the two patterns. In another sense the patterns 
are opposite; the Гз between any two members (with one from 
each of the two submatrices) will be minus one; for every char- 
acteristic on which one submatrix is positive in having the char- 
acteristics represented in its pattern, the other submatrix is nega- 
tive in not having it represented. í 

In the other pairs of submatrices, the similarity between two 
patterns may vary between, but not including, the extremes of 
plus one and minus one. In one extreme, the two patterns may K 
very similar in having all but one characteristic in common. : 
the other extreme, each pattern may be opposite to the other pat- 
tern except for one characteristic which they have in common. i 

Intercolumnar Correlational Analysis developed by the hare 
author out of an effort to improve the reliability and validity o: 
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coefficients of correlation (McQuitty, 1968). Illustrations with both 
theoretical and empirical data and mathematical developments 
and proofs are to be reported in a later article by McQuitty and 
Clark (1968). 


An Interpretation of Typical Results 


Return now to the original matrix of the usual kind of coefficients 
of correlation between the individuals of a sample of subjects; as 
before, let т;; represent any coefficient of correlation between any 
two individuals, ? and j of the sample of subjects. Let #;; represent 
the true correlation between i and j, or the mean correlation between 
i and j if it were computed on such a large number of samples that a 
repetition of the approach would produce a negligible change in 7,;. 

Then: 

Та = Fas + ei; where e;; is defined as the amount of error between 

‚ fu and Ри. 

Analogously, let Т, = Ї,, + Ej. Let I, = any Intercolumnar 
Correlation of any submatrix of the final analysis, where Т,“ = 1, 
or minus a negligible fraction. Then: 

Ти" = 1 (or nearly so) > Le bli А E. Ij 
Tij, except for some few I's with small superscripts which first move 
in the opposite direction to their final trend. Consequently: 

Е," = 0 (or nearly во) < ЕСИ тае m с Ej < 
€; with the analogous exception as specified above. 

In other words, by continuing to compute and re-compute the 
covariance between 7 and j in terms of successive Г "s, we gradually 


redefine i and j in terms of fewer characteristics and thereby elim- 
inate all error variance. 


4 Definition of Error Variance 


method is rather uniquely ap- 
ts interaction variance; in this 
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kind of data, test responses are interrelated differently in different 
categories of people (statistical types). ud 

If on the other hand the data reflect only linear relations then the 
method has no unique appropriateness to the data. 

The method is unique in another way. It is reasonably applicable 
across all kinds of data; those which include only linear relations 
as well as those which include all other kinds such as curvilinear 
and disjunctive relationships. 

The general appropriateness of the approach is a function of 
the fact that its definition of error variance derives from the inter- 
action of the method with the data. If only a linear continuum 
exists in the data, the analysis of the original matrix will define 
error variance in terms of non-conformity to two points on this 
continuum, each category of subjects defining a point on the con- 
tinuum. 

An example of linearity at any level of classification is two sub- 
matrices in which the Т,;* for all pairs of individuals, with one from 
each submatrix, is minus one. One category of persons is toward one 
end of a continuum and the other category is toward the other end 


of the continuum. 
Characteristics described as chance in terms of the first two 


points are introduced again along with all other characteristics 
in the analysis for the next level of classification, 1.е., the analysis 
of the two submatrices which defined the first two points. If the 
data continues to conform to a single continuum, the second level 
of classification will spread each of the two original points into 
two separate points along the same continuum. This general ap- 
proach is repeated again for lower and lower levels of classification 
until every individual has a unique ordinal position on à common 
continuum, except for those few, if any, individuals who agreed 
in their answers; they will have & position in common to them. 
These points are correct if and only if the data reflect one and 
only one continuum, and no other configuration. 


A Complez of Continua. 


The method can isolate a single continuum as represented by & 
straight line, or a complex of continua, represented by several 


straight lines, as illustrated in Figure 1. 
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The numbers represent the points resulting from the several levels of 
classification from one through three. 


=== distance between points at end of lines Is unknown 
Aus size of the angles are unknown 9 
Figure 1. A hypothetical complex of linear continua. 


In Figure 1, each set of two points represents two catégories of 
subjects, isolated by each major step in the analysis; Points 1 and 
1 represent the two categories of subjects derived from analyzing 
the original matrix. Points 2, 2, 2, and 2 represent the four cate- 
gories realized by analyzing the two submatrices pertaining to the 
first two categories of subjects. Points 3, 3, 3, 3, 3, 3, 3, and 3 
represent the categories realized by analyzing the four submatrices 
pertaining to the four categories represented by the number 2. 

Each pair of points (at the end of a line) can be accepted as 
categories for testing the hypothesis that they define a psycho- 
logical continuum. One technique can use any one of the many 
approaches in item analysis, designed to select the items which 
differentiate between two criterion categories of subjects. Those 
items significant at a stated level can then be used to allocate 
subjects to the continuum. More refined methods can be used in an 
effort to assign the items (or a selected group of the items) to the 


continuum and then the individuals in terms of their responses to 
the items, 


LOUIS L. McQUITTY 597 


ysis has been effective in isolating a continuum or continua as 
the case may be. E] 

One of the unique values of a classification system, such as 
Intercolumnar Correlational Analysis, is that the error variance 
grows out of the interaction between the method and the data; 
error variance is defined in terms of interrelationships in the data. 
Relationships are isolated rather than superimposed and purified. 
In this sense numerical classification of data is more fundamental 
than measurement. It should often come first to see if the assump- 
tion of dimensions is justified. If it is, then linear methods can be 
applied to test dimensionality more exactingly. 

If all of the straight lines in Figure 1 were substantiated to 
represent psychological continua, then the resulting scales and 
ugual correlational methods could be applied to determine the 
angles between them; the angles might all be either zero or 180°, 
the several continua thus collapsing into but one continuum. In 
this case, however, the single continuum might well be found earlier 
in item-analyzing between the two points represented by ones, pro- 
vided all data is relatively free from chance influence. 

Other possible outcomes are that the data might be found to 
represent two or more continua, or they might be found to repre- 
sent types defined by configural relations, Still another possibility 
is that some responses might define continua and others might 
define configural types. 


The Special Case of Configural Relationships 


Suppose that the method has isolated statistical types which are 
rich in configural relationships (interaction variance is a general 
characteristic of the data) and that the statistical types have been 
substantiated by cross-validational studies. 

‘Suppose more specifically that the method has isolated the three 
types of individuals defined by the dots and circles listed in Figure 
2. Each circle represents a statistical type and each dot in a circle 
represents a characteristic. For purposes of identification, the cir- 
cles are coded one, two, and three. Each dot has one, two, or no 
numbers from 1 to 3 listed beside it. If a dot is unnumbered in 
a circle, that characteristic does not appear in either of the other 
two types. If, on the other hand, a dot has one or two numbers 
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Type 3 
Type 1 ype 


Р Туре 2 
№ . n 
Figure 2. Hypothetical types with configural relationships included. 


beside it, it appears also in the type or types denoted by the one 
or two other code numbers. 

If such types as represented in Figure 2 are substantiated, then 
two generally used statistical approaches are inappropriate for this 
kind of data. We cannot, for example, measure the psychological 
distance between any two types, nor any two characteristics from 
two or more types. The same objective characteristic, such as a 
“yes” answer to Item 1, is not the same psychological answer in 
both types 1 and 2, because it is part of two different configura- 


tions. The meaning, significance, and predictive ability of the an- 
swer is a function of the other answers with which it occurs, as 
illustrated forcibly by the Meehl paradox (1950). 

Analogously, the correlation between two items over two or more 
types is psychologically meaningless because the items have differ- 
ent configurations of Tesponses in the several types. This does not 
mean that statistical methods cannot be developed which can start 
with correlation indices, even in such types, and convert the ap- 
proach to statistics which are psychologically meaningful; such has 
been accomplished here in Intercolumnar Correlational Analysis. 
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Summary: A Cooperative Use of the Novel and Usual Applications 


Fundamental themes of this paper are that method should be 
adapted to theory, and that competing theories and their corre- 
sponding method should cooperate in efforts to advance knowledge, 

Theory defines error variance. Error variance exists in reality 
as a deviation from theoretical structures. Error free concepts and 
structures exist in theory only. Our purpose is to develop those 
concepts and structures which fit reality with a minimal amount of 
error variance required to explain deviations in reality from the 
concepts and structures of theory. 

Joint application of both linear theories of personality structure 
and configural theories, each in cooperation with its appropriate 
methods and data will prove more fruitful in the long run than 
either theory exclusively. Dimensional theories have already 
yielded an abundance of useful psychological continua. Joint ap- 
plications of both theories will enable us to determine in the long 
run both the extent of the two kinds of relationships and the 
situations in which they occur. " 

Even if the hypotheses which derive from в theory of types are 
never substantiated and we eventually drop the theory, the method 
can still prove helpful not only in freeing us from the theory but 
also in classifying data; the methods of classification derived from 
the theory are not inconsistent with dimensional theories. Classifi- 
cation of data serves to isolate hidden dimensions so that we can 
give more psychological meaning to them and thus isolate them 
still more clearly by the subsequent application of dimensional 


methods. 
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ESTIMATING TEST NORMS FROM VARIABLE SIZE 
ITEM AND EXAMINEE SAMPLES! 


DESMOND L. СООК Амр DANIEL І. STUFFLEBEAM 
"The Ohio State University 


Ons of the major problems in test development is that of secur- 
ing representative norms. This problem becomes somewhat acute 
when one is seeking the cooperation of schools in the administra- 
tion of a test which requires a considerable amount of student 
time to complete. Many schools are not willing to cooperate with 
test developers when such time demands are made hence there is 
some question about the representativeness of the norms. One pos- 
sible solution to this problem would be to give a smaller number 
of items or to use smaller numbers of students provided that the 
norm performance on the longer test for the larger group could 
be adequately estimated using the shorter test or the smaller num- 
ber of subjects. 

Lord (1962) has described a procedure for estimating test norm 
distributions by use of the item sampling technique. The specific 
purpose of his study was to determine if reliable estimates of & 
norms distribution for a 70-item multiple choice test could be made 
by administering a different sample of seven items to each of 10 
examinee samples consisting of 100 subjects each as opposed to 
administering the 70 items to a sample of 1,000 subjects. Estimates 
of norm data were also obtained from each of the 10 examinee 
samples of 100 subjects each on the total 70-item test. Compari- 
sons were made between the norm statistics (mean, standard devia- 
tion, and frequency distribution) and estimates of these same 
statistics derived from both the item samples and the examinee 


1A paper presented at the annual meeting of National Council on Measure- 
ments in Education, Chicago, Illinois, February 17, 1966. 
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samples. The general findings of his study were that the item 
sample estimate of the mean was closer to the norms mean than 
were seven of the 10 examinee sample estimates. With regard to 
the variance, the item sample estimate was closer to the norm 
variance than were five of the 10 examinee sample estimates. The 
distribution of test scores derived from the item sample estimates 
tended also to approximate the distribution of scores in the norm 
group. In discussing the results of this work, Lord called attention 
to the fact that there was much need to test out his conclusion on 
other sets of data since he used only one item sample estimate. 
He also noted the results may have been affected by the fact that 
only two-thirds of the total number of items were used since sam- 
pling was done with replacement. 

Plumlee (1964) reported on the value of deriving test norms 
using the item sampling approach for the industrial personnel situ- 
ation where the number of subjects was likely to be small (і.е, 

, less than 1,000) as compared to the educational situation. Her 
procedure consisted of comparing the mean and standard devia- 
tion for a 30-item punctuation test taken by 200 clerical applicants 
with estimates of the same statistics obtained from administering 
three items to 10 groups of 20 subjects each and to the 30 items 
administered to 10 examinee samples of 20 each. Her results showed 
that the mean obtained by the item sampling technique was closer 
to the norm mean than were eight of the 10 examinee sample 
estimates. The estimate of the norm standard deviation from item 


sampling was closer to the norm standard deviation than only one 
examinee prediction. Reco; 


be estimated on samples 
item sampling estimates 


the item sampling estimate. All of th 
larger samples were closer than the j 


for power tests, 


COOK AND STUFFLEBEAM 603 


Lord had noted in his 1962 report that a'conclusion on the value 
of item sampling techniques should be considered tentative until 
other data could be obtained. Plumlee's results, based upon rela- 
tively small item and examinee samples, offers support to the value 
of item sampling. As may be noted in the above studies, com- 
parisons between norm estimates from item samples and examinee 
samples have been made largely with relatively small item sample 
sizes (seven in the case of Lord and three by Plumlee), which 
represent item samples of only 10 per cent of the total items. 

The purpose of the present investigation was to extend further 
the research on item sampling procedures for estimating the test 
norms by using various size item samples and examinee samples, 
including the 10 per cent sample used by Lord and Plumlee. Spe- 
cifically, item sample estimates of norms were to be made from 
item and examinee samples of 10, 25, 33, and 50 per cent. A 
secondary purpose was to determine the validity of Lord's findings 
by correcting for a procedural step which he noted may have, 
affected his results, mainly that of item sampling with replacement 
leading to only two-thirds of the items being used. 


Procedure 


The general procedures followed in this study were those de- 
scribed by Lord (1962). The present investigators utilized an 
available achievement test in the area of College Hygiene devel- 
oped for the United States Armed Forces Institute by the Test 
Development Center of The Ohio State University. This test con- 
sisted of 115 multiple choice type items and was normed on a 
sample of 1,239 students. The test was basically untimed, with 
approximately two and one-half hours being allowed for the ad- 
ministration. An examination of the last five items shows that, 
on the average, 96 per cent of the examinees completed these items. 
Thus, time was not considered to be an influencing factor on the 
results. 

The examinee samples were constructed by taking every tenth 
person in the case of the 10 per cent sample, every fourth person 
for the case of the 25 per cent sample, and so on for the other 
proportions of 33 and 50 per cent. 

The item samples were drawn from the original pool of 115 
items to represent similar proportions of the total test length. Ten 
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11-item samples were obtained by using a table of random numbers 
in a manner such that the first 11 items selected were withdrawn 
from the total pool. The second.set of 11 items was then drawn 
from the remaining 104 items, the third 11 from the remaining 
93, and so on. Similar procedures were employed to establish four 
item samples of 28 items each, three item samples of 38 items each 
and two item samples of 57 items each. 

Means, standard deviations, item statistics and score distribu- 
tions were next obtained for each item and examinee sample. This 
was accomplished through the use of the O.S.U. Item Analysis 
Program and The Ohio State University Computer Center's I.B.M. 
7094 computer. The means and standard deviations for the item 
samples were then extrapolated and averaged to obtain for each 
size item sample one estimate of population mean and one estimate 
of the population standard deviation. Next, negative hypergeomet- 
rie curves were fitted to all examinee and item samples. The pro- 
cedure was to fit these curves to the mean, standard deviation and 
sample size for each distribution, according to the procedure out- 
lined by Lord (1960). The investigators wish to &cknowledge and 
thank Frederic Lord for his assistance in computing these distri- 
butions on Educational Testing Service computers. 

To compare the empirical score distributions of the examinee 


samples with that of the norms population, the following measure 
of distance was employed: 


~ 
р= уку P — 0)°/0 (1) 
Where D is a measure of discrep: 
distributions, K is the number of examinee samples, Ø` is К times 
and ў is the frequency of 
a score in the population norms distribution. It should be noted 


enting by K, the number 
of score frequencies for the examinee sample distributions. 

ed to compare the fitted 
th the item and examinee 
- In this case, however, it was 
appropriate. The 1/K cor- 
aggedness of distributions was 


COOK AND STUFFLEBEAM 605 


not indicated because all fitted distributions were smooth. The use 
of this formula here would have resulted in a misleading advantage 
for the smaller item samples. This limitation did not appear in Lord’s 
1962 study, where he did use D in this situation, since he used only 

“one size of item sample. The procedural solution decided upon was 
to use the Chi-square as a measure of distance between the fitted 
sample and the empirical population norms distributions. 


Results 


Descriptive statistics showing the performance on the total test 
of 115 items for the 19 examinee samples are presented as Table 1, 
Similar data for the four item sample estimates are presented in 
Table 2. An interesting observation in this table is the increases 
in reliability with increases in item and examinee sample sizes. 


TABLE 1 
Descriptive Statistics for Various Examinee Sample Sizes Ы 

=———————————————————Є- 

Ехашїпее Standard 

Samples N Mean Deviation KR20  KR21 S.B.* 
Total Group 1239 66.94 17.71 .93 .92 .94 
One-Tenth 

Group A 124 67.90 18.35 .94 .93 .94 

Group B 124 65.35 17.23 .93 .91 .94 

Group C 124 67.90 17.93 .93 .92 .94 

Group D 124 67.01 18.22 .94 .92 .94 

Group E 124 67.47 17.61 .93 .92 .93 

Group F 121 67.06 18.02 .93 .92 .93 

Group G 121 66.29 18.14 .93 .92 .95 

Group H 119 66.18 17.11 .93 .91 .92 

Group J 121 68.05 17.87 .93 .92 .94 

Group К 124 67.23 18.92 |^ 92 .91 .92 
One-Fourth 

Group A 311 67.06 17.48 .93 .92 .94 

Group B 308 65.66 17.77 .93 .92 .93 

Group C 306 67.10 18.66 .94 .93 .94 

Group D 309 67.82 17.03 .93 .91 .92 
One-Third 

Group A 413 66.53 17.98 .93 .92 .94 

Group B 413 66.66 17.85 .93 .92 .94 

Group C 413 67.64 17.33 .93 .92 .93 
One-Half 

Group A 621 67.13 18.20 .93 .92 .94 

Group B 618 66.75 17.22 .93 .91 .93 


*Spearman-Brown correction based on odd-even split. 
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TABLE 2 
Descriptive Statistics for Various Item-Sample Sizes 


Item Standard 
Sample Size N Mean Deviation KR20 KR21 S.B." 
11 Items 
Group A 124 6.88 2.38 58 .55 54 
Group B 124 6.04 2.13 50 .37 48 
Group C 124 6.42 2.40 61 .54 71 
Group D 124 5.79 2.22 52 .43 58 
Group E 124 5.97 1.98 41 .26 52 
Group F 121 6.78 2.29 57 .89 AT 
Group G 121 6.45 2.34 57 .51 58 
Group Н 119 6.18 1.91 38 .20 46 
Group J 121 6.70 2.15 52 ES AT 
Group K 124 5.81 2.22 49 .43 61 
£8 Items 
Group A 311 15.69 4.29 70 .22 75 
Group B 308 14.56 4.99 76 .43 75 
Group С 306 18,74 5.04 78 78 82 
Group D 309 15.75 4.78 73 .98 74 
$8 Items 
> Group А 413 21.11 5.97 79 75 80 
Group B 413 21.89 6.46 81 .79 84 
Group C 4l3 22.95 6.10 79 77 80 
67 Ttems 
Group А 621 34.59 9.35 87 86 88 
Group В 618 31.81 8.70 85 83 87 


*Spearman-Brown correction based on odd-even split, 


The principal results of the investigation are reported in Table 
3, Means and variances are shown for all samples. D statistics 
indieate the distance or discrepancy between the empirical fre- 
quency distribution for each examinee sample and for the popula- 
tion. Chi-square values represent the distance between the hyper- 


Geometric distribution for each sample and the empirical fr 
distribution for the population. р: equency 


Discussion 


the population mean. 
(290.02 to 357.97) als 
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variance, 313.64. Kuder-Richardson 20 internal consistency coeffi- 
cients for the examinee samples ranged from .92 to .94 and all 
were thus essentially equivalent to the population value of .93. 
The four item samples also provided good estimates of the popu- 
lation data. The item sample means ranged from 65.86 to 66.73. 
While these slightly underestimated the population means of 66.94, 


TABLE 3 


Comparisons of Estimates of Norm Distribution for Various - 
Examinee Sample and Item Sample Sizes 


1 2 3 
Number _———т 
Data of Items N Mean Variance Ds рь Xf 
One-Tenth 
Group G 115 121 66.29 329.06 99.40 5.75 57.46 
Group E 115 124 607.47 310.11 97.14 5.64 56.41 
Group B 115 124 65.35 296.87 87.54 6.70 67.03 
Group A 115 124 67.90 336.72 87.43 6.44 604.40 
Group K 115 124 67.23 357.97 84,93 7.14 71.4% 
Group Н 115 119 66.18 292.75 83.76 6.10 60.99 
Group D 115 124 67.01 332.33 82,11 5.83 58.32 
Group Е 115 121 67.06 324.72 71.97 5.48 54.81 
Group С 115 124 67.90 321.49 68.69 6.22 62.22 
Group Ј 115 121 68.05 319.34 63.46 6.33 603.28 
Item Sample 11 1,236 65.86 327.13 — 5.93 59.30 
Norm Group 115 1,239 606.94 313.04 — — — 
One-Fourth 
Group B 115 308 65.66 315.77 70.27 14.91 59.67 
Group D 115 309 67.82 290.02 62.47 15.46 61.83 
Group A 115 311 67.06 305.55 59.13 13.71 54,82 
Group C 115 306 67.10 348.20 54.87 16.46 65.85 
Item Sample 28 1,234 66.43 323.02 — 13.64 54.55 
Norm Group 115 1,239 66.94 313.64 — — = 
One-Third 
Group A 115 413 66.53 323.28 65.74 18.18 54.54 
Group C 115 413 67.64 300.33 59.65 18.85 56.54 
Group B 115 413 66.66 318.62 58.05 17.88 53.65 
Item Sample 38 1,239 606.39 274.96 — 23.49 70.48 
Norm Group 115 1,239 66.94 313.64 — — rae 
One-Half 
Group A 115 621 67.13 331.24 44.40 29.02 58.05 
Group B 115 618 67.85 296.53 44.40 27.39 54.77 
Item Sample 57 1,239 606.73 313.48 — 26.60 53,19 
Norm Group 115 1,239 66.94 313.64 — — — 


^D, = index of distance between the empirical norms score distribution and an empirical sample 
score distribution. ed 

bD, = index of distance between the fitted norms score distribution and а fitted sample score 
distribution. 

eX:f = Chi-square between the fitted norms score distribution and a fitted sample score 
distribution, 
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all were within the standard error of measurement for the popula- 
tion mean. The item sample estimates of variance ranged from 
274.96 to 327.13 and thus compared favorably with the population 
variance of 313.64. The examinee and item sample estimates of 
means and variances are compared below. 

In the case of the “one-tenth” sample data, the item sample 
estimate of the mean is closer to the norm mean than only two of 
the “one-tenth” examinee sample estimates. For the “one-fourth” 
samples the item sample estimate of the mean is closer than two 
of the four examinee sample estimates. For the “one-third” sample 
data, the item sample estimate of the mean is closer than only one 
of the three examinee sample estimates. For the “one-half” sample 
data, the item sample estimate of the mean is closer to the norm 
mean than one of the two examinee sample estimates. 

With regard to the estimates of the norm variance, the “one- 
tenth” sample data show that the item sample estimate of the 
,variance was closer to the norm variance than were seven of the 
11 examinee sample estimates. For the “one-fourth” sample case, 
the item sample estimate was closer than two out of the four ex- 
aminee sample estimates, For the “one-third” sample case, the item 
sample estimate of the norm variance was more discrepant than 
all three of the examinee sample estimates. For the “one-half” 
sample situation, the item sample estimate of the norm variance 
was closer than both of the examinee sample estimates. 

The D values for the respective examinee samples indicate a 
strong relationship between the sample size and the discrepancy 
between the empirical sample and population distributions. The 
larger the sample, the smaller the distance between distributions, 
D values for the “one-tenth” examinee samples range from 63.46 
to 99.40. For the “one-fourth” samples, they range from 54.87 to 
70.27. For the “one-third” samples, they range from 58.05 to 
65.74. Both D's for “one-half” samples are 44.40, 

As shown by the column of Chi-square values, nearly all hyper- 
Sio бшш a ces ЫЧ лае die 

a › m sampling estimates were superior to 
the examinee sample estimates. For the “one-half” sample, the 
item sample Chi-square was lower than those for either of the 
kom samples. For the “one-fourth” samples, the item sample 

-square was lower than those for all four examinee samples. 
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For the “one-tenth” samples, the item sample Chi-square was lower 
than those for six of the 11 examinee samples. The one exception 
to this trend of superiority for the item samples was in the “one- 
third" samples where the item sample Chi-square was higher than 
all three of those for the examinee samples. 

Comparing the Chi-squares for the four different item samples, 
there does not appear to be a strong relationship between the size 
of item sample and the magnitude of the Chi-square. The notable 
exception is that the Chi-square for the "one-third" item sample 
is 70.48, while the others range from 53.19 to 59.30. These latter 
values are excellent, considering that the Chi-square between the 
population distribution and its own negative hypergeometric dis- 
tribution was 53.18. This Chi-square, which had 63 degrees of 
Íreedom and .81 probability level, indicates a very close fit be- 
tween the population distribution and its own negative hypergeo- 
metric distribution. 


Summary 


Generally, this study supported Lord's conclusion that item sam- 
pling is equally effective, if not superior to examinee sampling in 
test norming. Considering that examinee sampling is the more ex- 
pensive of the approaches, there would seem to be a clear advan- 
tage for the item sampling approach. 

It should be noted, however, that this study used data from a 
test administration in which all students were given all test items. 
'The item and examinee samples were drawn post-mortem to data 
collection, and the results reported here may lack generalizability 
to situations where sampling would be done in advance of data 
collection. Thus, there is a need for extending item sampling re- 
Search into the area of a priori sampling. 
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EQUIVALENT SCORES FOR THE GRADUATE RECORD 
VERBAL AND MILLER ANALOGIES TESTS 


EDWARD E. CURETON лхо THOMAS В. SCOTT1 
The University of Tennessee 


Tur Graduate Record Examination (GRE) and the Miller 
Analogies Test (MAT) are the two tests most widely used for 
admission to graduate study in American universities. Since they 
are issued by different publishers (Educational Testing Service 
and Psychological Corporation respectively), equivalent scores 
have not been reported previously. The Graduate Record Exami- “ 
nation Aptitude Test has a Verbal section and a Quantitative 
section. The Miller Analogies Test is essentially all verbal in con- 
tent. It seems logical, therefore, to equate scores on the Miller 
Analogies Test to the Verbal scores on the Graduate Record Ex- 
amination (GRE-V). A recent symposium on the equating of non- 
parallel test scores covering the limitations of this process for 
general use, with papers by Angoff, Flanagan, Lennon, and Lind- 
quist, appeared in Volume I (June, 1964) of the Journal of Educa- 
tional Measurement. 

The results reported here are based on 1341 pairs of scores from 
eleven universities (Buffalo, Cornell, Florida, Indiana, Kansas 
State, Maryland, North Carolina, Pennsylvania State, Rutgers, 
Tennessee, Texas). This is a “pick-up” sample: data were not 
requested from every university which offers advanced degrees, 
nor from every university which uses either or both tests. Some 
sets of scores were for applicants; some were for admitted graduate 
students. Some were for single departments or schools; some may 
have been for entire graduate schools. Information was not ob- 


1The junior author collected the data. The senior author is responsible for 
the analysis. 
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tained on how many students took which test first, nor on the 
time intervals separating the taking of the two tests (which varied, 
no doubt, within one set of scores as well as between sets), nor on 
which forms of the GRE were taken. But since Educational Test- 
ing Service reports results in terms of equated standard scores, this 
last item is not important. Data on the forms of the MAT taken 
were incomplete, but such data as were available suggest that the 
majority of the subjects took Form K, that a number took Form 
H or Form J, and that a substantial minority took Form L. Forms 
H and J use the same norms as Form K, but the norms for Form 
L differ about two raw-score points from those of Forms J, H, and 
K. No correction for form was used. 

In view of the defects noted above, it seemed advisable to esti- 
mate the limits of the resulting errors by equating the scores for 
each university separately. For each of the eleven sets of data, 
two frequency distributions were prepared, grouping the MAT raw 


Figure 1. GRE-V and MAT: Eighteen equi 
versities. À few points represent two, three, 


ipercentiles for each of eleven uni- 
or in one case four equipercentiles. 
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scores by fours and the GRE-V standard scores by 30's. The fre- 
quencies of each distribution were then smoothed, using the five- 
point weighted moving average formula, 


Corrected fo=(—8f-2 + 123: + 17fo + 12] —3f2) / 35. 


Over each set of five points, this formula preserves parabolic or 
cubic trend, with one or two degrees of freedom for smoothing. 

Each pair of smoothed frequency distributions was equated by 
the equipercentile method at percentiles 1, 2, 3, 5, 10, 15, 20, 30, 40, 
50, 60, 70, 80, 90, 95, 97, 98, and 99 recording MAT scores to 
the nearest unit and GRE-V scores to the nearest ten. Figure 1 
shows the results. Each plotted point is an equipercentile for one 
or other of the eleven sets of scores. In some cases, one point 
represents two to four equipercentiles. Eleven equating curves join- 
ing these points could have been drawn, but after connecting five 
sets of equipercentiles, the results appeared more confusing than 
helpful. Figure 1 shows quite clearly that if a single equating 
curve is used instead of eleven curves, the equating error for any . 


lo 15 20 25 30 35 40 45 50 55 60 6s 70 75 80 85 90 95 


MAT 
Figure 2. GRE-V and MAT: Equipercentile curve and 24 equipercentile points 
for eleven universities combined (N — 1341). 
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one university will not, exceed the standard error of measurement 
of either test. The linear correlation between the 1341 pairs of 
scores is .715, indicating that they are enough alike in functions 
measured to justify equating. Using reliability data from the test 
manuals, the correlation corrected for attenuation was estimated 
to be about .88. 

The eleven frequency distributions for each test were combined, 
the two combined distributions were smoothed as before, and 


TABLE 1 


Equivalent Scores on the Miller Analogies Test and the Graduate 
Record Aptitude Test: Verbal 


MAT — GREV MAT GRE-V MAT GRE-V 
IL eee M. МАТ GRE-V 
13 300* 40 450 70 630* 
M 300 а 460 т 640* 
15 310 43 470* 5 Ee 
ш Ый 13 650 
1 E m 480* 74 660 
18 330 45 480 75 660* 
19 330* 46 490 76 670* 
a TS Я 500, 77 670 
21 340 49 510 79 690» 
1 Ит 690 

23 360 50 510 
24 360* г. 520* s Poo 
520 82 т10* 
м PN я ^m 
27 380 20 
2 390 55 540* 
2 pe A pud 85 730 
K 86 740* 
2 ia S 550 87 740 
al 400° D 560 88 750 
a nie 570 89 760 
410 
34 420 a ia 90 770 
ү ин 91 780 
35. 430 63 500 92 790 
36 430* 64 600 93 800 
т а” 94 810 
39 450° e eid 95 820 
67 610 
68 620 


a hd 
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equipercentile scores were computed at percentiles 0.1, 0.5, 1, 2, 3, 
6, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 85, 90, 94, 97, 98, 99, 99.5, 
and 99.9. These scores were recorded to 1/10 of a score point for 
the MAT, and to one standard-score point for the GRE-V. The 
pairs of points were plotted on 17" by 22" quarter-inch cross- 
section paper, with each quarter-inch representing one score point 
on the MAT and ten standard-score points on the GRE-V. Figure 
2 is a reduced re-plot of these points. A curve was fitted by eye 
to the points (as in Figure 2), and the equivalent scores were read 
from the plot. 

The final results are shown in the accompanying table. The table 
shows the GRE-V standard score, rounded to the nearest ten, that 
is equivalent to each unit raw score on the MAT, over the range 
covered by our sample. Wherever the same GRE-V standard score 
is the nearest equivalent to two MAT raw scores, one of the two 
GRE-V standard scores is followed by an asterisk, indicating that 
this is the score to be used in locating the MAT equivalent. Thus,, 
300 is the GRE-V standard score most nearly equivalent to both 
MAT raw scores 13 and 14, but the asterisk indicates that MAT 
raw score 18 is closer than MAT raw score 14 to the GRE-V 
standard score 300. 

In the region below MAT raw score 90 and GRE-V standard 
score 770, the relation is almost linear. For this region the two 
simple linear equations, 


GRE-V = 216 + 6 MAT, 
МАТ = (GRE-V)/6 —36, 


yield results which, when rounded to the nearest MAT unit or the 
nearest GRE-V ten, do not differ by more than one MAT unit 
or ten GRE-V units from those given by the table. , 
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DIFFERENTIATION OF SOLDIER REACTIONS TO 
SEVERE ENVIRONMENTAL STRESS BY MMPI AND 
OTHER VARIABLES! 


JAMES K, ARIMA 
Litton Scientific Support Laboratory, Ft. Ord, California 


Tuts study investigates the ability of 30 MMPI and five demo- 
graphic variables to differentiate the reactions of 777 U. S. Army 
soldiers to a physically strenuous environmental situation. The op- 
portunity for conducting the study arose when the U. S. Army 
desired to learn what psychological effects might result from pro- 
longed wearing in the field of extremely efficient protective clothing 
that would encapsulate an individual but which would not provide 
an artificial internal environment, as found in clothing developed 
for aircraft pilots, astronauts, and divers. 

Preliminary field studies showed that prolonged wearing of the 
clothing resulted in many individuals falling out of field exercises 
for a variety of physiological complaints, most of which appeared 
to be directly related to increased body temperature and water 
deprivation. These complaints, in turn, seemed to result because 
the fully encapsulating mode of the clothing, and to a less extent 
a partially encapsulating mode, severely restricted dissipation of 
body heat, precluded intake of food and water, and prevented 
elimination of body wastes. This being the case, the Question was 
rephrased for this portion of the overall study to ask what iden- 
tifiable individual personality characteristics might be associated 
with the reactions seen in the field to the protective clothing. Eval- 
uation of the effects of the clothing on psychomotor and cognitive 


1 This research was carried out at the US Army Combat Developments Com- 
mand Experimentation Command, Ft. Ord, Apes m while the author was a 
member of that organization. Reproduction in whole or in part is permitted for 


any purpose of the United States Government. 
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abilities was also conducted, and the results are being reported 
elsewhere (Arima, 1965). 


Method 


Subjects 


The Ss were experienced soldiers of the 194th Armored Brigade 
which typically provides Ss for experimentation at the U. S. Army 
Combat Developments Command Experimentation Command, Ft. 
Ord, California. Their distribution according to age, General Tech- 
nieal (GT) scores of the Army Classification Battery (ACB), and 
Education are shown in Tables 1 and 2. 


TABLE 1 
Categories of Demographic Variables 
" Categories 
Variables 1 2 3 4 5 
Age (years) Under 18 18-19 20-24 25-34 35 or 
older 
GT Scores "Under 90 90-109 110-129 130 or xxx 
p more 
Education Less than 8 8-11 12 13-16 More 
(years) than 16 
Rank Other than NCO Officer or xxx xxx 
CREE (E-5 or Warrant 
an higher) Of 
NT Less than 1yr to 2 yrs to 3 yrs to 10 yrs 
егүісе 1уг lyrll 2 yrs 11 9 yrs 11 or more 


mos mos mos 


Е АЕ ое ВО ООО ООО 


The GT score is an equally weighted composite score of the Verbal 
and Arithmetic Reasoning subtests of the ACB with a mean of 


100 and a SD of 20. It provides an estimate of intelligence scores 
based on similar test materials, 


3 
Procedure 

i n SEP was administered to groups of approximately 100 

in uni messhalls prior to their participation in field exercises 

designed to evaluate the effects of the protective clothing. The 

field exercises, of varying severity and length (from ош. 4 to 

72 hrs.), required the performance of typical combat tasks for 
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TABLE 2 
Distribution of Subjects According to Demographic Variables 


Category? 

Characteristic Totals 1 2 3 4 5 

Age No. 777 4 76 464 177 56 
% 100.0 0.5 9.8 59.7 22.8 7.2 

GT Score No. 668 214 298 136 20 — 

% 100.0 32.0 44.6 20.4 3.0 = 

Education No. 758 12 253 410 80 3 
96 100.0 1.6 33.4 54.1 10.5 0.4 

Rank No. TU 565 192 20 = = 

% 100.0 72.7 24.7 2.6 — — 

Length of No. 774 135 172 171 177 119 
Service % 100.0 17.4 22.2 22.1 22.9 15.4 


*Owing to deficiencies in the data, the total 777 cases could not be used for all characteristics. 
"Category Code: See Table 1. 


ground troops. Shorter exercises were run until they were com- 
pleted, but the longer exercises were terminated for individual 
units and elements when 50 per cent of the participants were 
forced to drop out. In some instances, dropouts exceeded 50 per 
cent owing to lag in the system for reporting and evaluating their 
relative frequency. Dropouts were first seen by medical monitors 
in the maneuver area who made out an emergency medical tag 
which gave the symptomatology necessitating removal of the in- 
dividual from the exercise. When the dropout had been evacuated 
to the aid station established for the exercises, he was seen by a 
medical officer who made and recorded the final determination of 
the reason for dropping out. The individual was then rehabilitated. 

MMPI Variables. Individual MMPI protocols were scored by 
National Computer Systems, Minneapolis, Minnesota. T-scores 
with the K-correction were obtained for the three validity scales, 
the 10 basic scales, and the 11 special scales (Dahlstrom and 
Welsh, 1960). Raw scores were used for the “?” scale. In addition, 
Band, Delta, Anxiety Index, Dissimulation Index, and Internaliza- 
tion Ratio were computed for each individual. 

Invalid profiles were defined as those meeting at least one of the 
following rules: (a) Equal to or greater than 70 on validity scale 
L, (b) equal to or greater than 80 on validity scale F and equal to 
or less than +39 on the dissimulation index; and (c) equal to or 
greater than 80 on validity scale F and equal to or greater than 
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4-40 on the dissimulation index. The first rule was used to identify 
the “minus getting” set and random-response set; the second to 
identify the random-response set (along with the first rule) ; and 
the third rule was used to identify the “plus getting” set. 

Criterion Coding. All individuals were first classified into drop- 
outs and those who did not drop out (hereafter called nondrops). 
In addition, the reasons for dropping out were categorized into as 
few mutually exclusive and exhaustive categories as possible. For 
reasons which will become clear below, only those categories which 
had greater than 10 dropouts among valid MMPI protocols were 
used. These categories, with their alphanumeric codes, are as fol- 
lows: 


A00=Rectal temperatures equal to or greater than 103°F. 

B00=Fatigue, exhaustion, weakness 

C00-Stomach cramps, abdominal cramps, heat cramps (abdom- 
inal), dry heaves 

DOO=A combination of B00 and C00 

E00=The diagnosis “water deprivation” 

F01=Headache, dizziness, sinusitis 

F02—Stomachache, nausea, vomiting, upset stomach, gastritis, 
indigestion 

F03—Upper respiratory infection (URI), pleurisy, tonsillitis, 
chills, exposure, pharyngitis, bronchitis 

F04=Musculo-skeletal or traumatic symptoms to include pain 
in side, cramps in extremites, sprains, bruises, back trouble, 
lacerations 


eant one such as hemorrhoids, diarrhea, bleeding 


Results 
MMPI Characteristics of the Sample 


"as by total (777 istics of the sample of 777 soldiers are 

in Table 3, ), valid (667), and invalid (110) protocols 
Figure 1 graphically portrays th 

by total, valid, and invalid individu Profile of mean grouped scores 


dual protocols for th idi 
e 1 e validity and 
scales, Figure 2 shows these profiles for the special scales. 
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TABLE 3 


Mean (M) and Standard Deviation (SD) of Scores on the MMPI Validity, Basic, 
and Spécial Scales and the Special Indexes According to Total Sample 
(№ = 777), Valid Profiles (М = 667), and Invalid Profiles (№ = 110) 


Total Valid Invalid 
Scale or Index M 8р м 8р м 8р 
Validity 
f 3.24 7.40 3.24 7.57 3.22 6.33 
L 51.69 8.24 50.95 7.13 56.20 12.18 
F 61.25 12.93 57.59 8.05 83.48 14.59 
K 53.18 9.39 53.88 8.88 48.90 11.13 
Basic 
Hs 59.59 13.07 57.04 11.57 71.40 15.28 
D 61.10 12.89 59.23 11.74 72.42 13.80 
Hy 59.66 10.21 58.31 9.14 67.81 12.44 
Pd 64.52 11.28 62.97 10.33 73.88 12.29 
Mf 54.04 8.99 53.11 8.60 59.68 9.27 
Ра 55.58 11.94 53.13 9.32 70.40 15.04 
Pt 59.41 13.33 57.20 11.67 72.80 14.91 
So 62.24 15.29 58.70 11.97 83.73 15.71 
Ma 62.54 11.07 61.10 10.71 71.27 13.36 
Si 53.00 9.31 52.01 8.93 59.00 9.38 
Special 
A 51.18 11.70 49.58 10.52 60.85 13.70 
R 51.35 10.59 51.22 10.02 52.15 13.58 
Es 51.06 11.10 52.79 9.99 40.59 11.79 
Lb 55.10 11.44 54.43 11.21 59.19 12.04 
Ca 56.37 11.20 54.74 10.03 66.27 12.73 
Dy 50.11 10.65 48.81 9.82 57.98 12.09 
Do 48.87 09.80 49.52 9.62 44.93 9.98 
Re 44.39 10.65 45.25 9.98 39.19 12.92 
Pr 54.67 10.39 53.39 9.51 62.41 12.09 
8t 55.05 7.79 55.77 7.54 50.71 7.90 
Cn 53.85 11.43 53.31 10.97 57.14 13.50 
Indexes 
Dissimulation 8.07 18.97 3.70 14.11 34.57 22.70 
Anxiety 61.38 21.13 58.88 19.93 76.55 21.94 
Internalization 1.05 15 1.06 15 .99 12 
Band .95 20.03 —.97 74 12.70 25.01 
Delta .84 20.72 14 .76 5.06 25.15 


The relative frequencies of all possible permutations of validity 
scale patterns (excluding the “?” scale) among valid and invalid in- 
dividual protocols are shown in Table 4, The salient characteristic 
which distinguishes the invalid profile is the dissimulation index 
resulting from а high Е without a similar elevation on the К scale. 
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Mean 
4 Score 


90 
85 
80 
75 


20 — All cases 
15 — — — — Valid profiles 
3j sre Invalid profiles 


0 TOI ЕК Hs D Hy Pd Mf Pa Pt Sc Ma Si 


Validity Scales Basic Scales 


Figure 1. Mean ¢ scores on the MMPI validity and basic scales. 


Mean 
t Score 


80 
15 
10 
65 
60 
55 
50 


— All cases 
=== уана profiles 
turis Invalid profiles 


А В Es Ib Ca Dy Do Re Pr & Cn 
Special Scales 
Figure 2. Mean t scores on the MMPI special scales. 


— 
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Differences occur between valid and invalid grouped mean scores 
on most of the basic and special scales and special indexes except 
for the *?" and R scales. The differences are in the expected direc- 
tion. 


Demographic Factors 


Table 5 shows the relative frequency of dropouts and the relative 


TABLE 4 


Differences in Relative Frequency of Validity Scale Patterns for Valid 
and Invalid Profiles (excluding scale ?) 


Relative Frequency (Per cent) 


Scale Pattern Valid Invalid 
LFK 6.5 10.9 
LKF 6.5 9.0 
FLK 27.7 70.0 
FKL 24.6 9.1 
KLF 18.1 0.9 
KFL 16.6 0.0 

TABLE 5 


Relative Frequency (Per Cent) of Dropouts and Invalid MMPI 
Profiles Within Demographic Categories 


Categories* 
Characteristic 1 2 3 4 5 
Age Dropouts 50.0 40.6 36.6 27.1 17.4 
Invalids 00.0 25.0 14.0 10.2 14.3 
GT Score Dropouts 38.6 32.9 28.5 19.4 — 
Invalids 19.2 13.4 12.5 00.0 A 
Education Dropouts 43.3 37.5 30.9 36.0 00.0 
Invalids 8.3 18.2 12.9 8.8 00.0 
Rank Dropouts 37.5 23.3 13.3 — — 
Invalids 15.9 9.9 5.0 — — 
Length of Dropouts 46.5 27.1 40.9 30.6 17.6 
Service Invalids 12.6 12.2 18.1 13.6 14.3 


"See Table 1 for category code. 


frequency of invalid MMPI profiles within each demographic cate- 
gory. The general trend is an inverse relationship between the 
relative frequency of dropouts and invalids, on one hand, and the 
demographie variables on the other. This relationship is more con- 
sistent for the relative frequency of dropouts. 
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MMPI Variables 


For each scale and special index, invalid profiles were compared 
against valid profiles for dropouts and nondrops separately, and 
dropouts were compared against nondrops for valid and invalid 
MMPI profiles separately. The confidence to be placed in the ob- 
served differences were evaluated by t-tests in each case,? and the 
results are presented in Table 6. The results show that the differ- 
ences between invalid and valid MMPI 


TABLE 6 


Scales and Special Indexes of the MMPI on Which Significant 
Differences Occur for Valid and Invalid Profiles and 


Dropout and Nondrop Groupings 
———————————Є———Є——Є—Є—Є 
Nondrop Dropout Valid Invalid 
Invalid vs. Invalid vs. Dropout vs. ^ Dropout vs. 
Scale or Index Valid Valid Nondrop Nondrop 
КО ы шше олоор. Nondrop 
Validity 
i юв, n.s. n.s. ce 
** ** 
Е ** ++ "sg > i 
K Wed 3e 2.8. пз. 
Basic 
Hs БЫ LI 
р „э Pry E x: 
y ss se i 
Pd +. ** M ai 
Mt + * = 
Pa as 12 пз. n.s. 
Pt . + bey um 
Вс КЫ з. m и 
Ма ** as e 
Si ee by, ae n.8. 
Special 5 e 
A ae $e es 
R ns. ns. n RA 
Es RES E a .8. n.8. 
Lb Prd r Й n.s, 
Ca es i n.8. n.8. 
Dy ss bis Y n.s. 
Do — ee T юв. 
Ве uS Des T. ns, 
Pr з эө ns. 0.8. 
St uS EA oe пз. 
Cn es р 6 n.s. 
n.s. n.s. 


2 The ¢ tests were all programmed and run 


search Institute, Menlo Park, California, оп computers by Stanford Re- 
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TABLE 6 (Continued) 


Nondrop Dropout Valid Invalid 

Invalid vs. Invalid vs. Dropout vs. Dropout vs. 
Scale or Index Valid Valid Nondrop Nondrop 

Indexes 

Dissimulation ых bad bd n.s. 
Anxiety ve s. ae n.8. 
Internalization 79% лек ne n.8. 
Band ә y ns. ns. 
Delta 0.8. ns. n.s. ns. 


*Significant difference at .05 level. 

**Significant difference at .01 level. 

-Indicates second member of pair has higher mean. 
profiles continue to exist for both dropouts and nondrops and that, 
except for the “?” and F scales, the MMPI does not differentiate 
dropouts and nondrops with invalid test protocols. The results 
show that the MMPI does differentiate between dropouts and non- 
drops with valid test protocols on more than half (16) of the 30 
scales and special indexes used in this study. Moreover, most of 
the differences have a high degree of statistical significance. 

These findings using valid MMPI protocols indicated it was 
feasible to investigate what scales and indexes of the MMPI would 
differentiate individuals grouped for specific reasons for dropping 
out from the general nondrop group. The analysis was carried out 
using only valid MMPI records and categories for dropping out 
with an N of at least 10, for reliability considerations. The results 
are shown in Table 7. 


TABLE 7 


Significant Mean Differences between Categories of Valid 
Dropouts and Valid Nondrops 


Dropout Category 
ди 
Index A00 BOO C00 DOO E00 Е01 F02 F03 F04 F05 

Validity 

? * 

L 

F oe ** 

K 

Basic 

Hs * oe ** 
р * * ** 
Hy * ** * 


Pd se 2 ** oe 
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TABLE 7 (Continued) 


Dropout Category 

Scale or 
Index A00 BOO COO DOO E00 FO! F02 F03 F04 Fo05 

Uo ns ір 


m 
e 
+ 


++ * ** Д LJ ** 


|=) 

о 

: 

П 

* 

n 
tae 
n 
l 


** * 
Anxiety Ы .. . 


zation - 


Delta Y 


‘Significant difference at .05 level, 
Significant difference at .01 level. 
"Indicates nondrops have higher mean. 
Each scale or index that showe 


ences for all dropouts also 
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Discussion 


Generally, the profile of mean scores on the MMPI scales are 
quite high, there is a large number of invalids, and the invalids 
differ from the valids in a consistently abnormal direction. The 
resulting group profile is considerably different from a typical pro- 
file of grouped scores for college students, who form the back- 
ground for much of the research on the MMPI among normal 
populations (Kleinmuntz, 1963), and is typically the picture of a 
more immature, young group having a lower level of education 
and socialization and coming from lower socioeconomic back- 
grounds. : 

Among the scales which differentiate the valid dropouts from 
valid nondrops, collectively and by reasons for dropping out, there 
is a strong representation of scales that are often associated with 
worrying, anxiety, ruminative thinking, and a lack or loss of zest. 
This is suggested first and foremost by the broad relevance of the 
Pt scale over many dropout categories and is supported addition- 
ally by the frequency with which the Sc, A, and Ca scales also 
differentiate valid dropout groups. Moreover, significant differences 
on the D scale and Anxiety Index only occur where differences 
also occur on РЁ Since all MMPI testing was done before partici- 
pation in the field experiment, it would appear that there was 
considerable apprehension among the troops over what might hap- 
pen to them physically and over how they would compare with 
their fellow soldiers in the ability to hold out under stress. It 
appears, in retrospect, that those who were threatened most be- 
forehand were-more likely to drop out once the experiment actually 
began. This explanation is consistent with the series of studies 
reported by Eriksen and others which tended to support the thesis 
that those high on the P£ scale are highly anxious, introverted, 
and pessimistic and are concerned and may be adversely affected 
by fear of failure (Eriksen, 1954a, 1954b; Eriksen and Browne, 
1956; Eriksen and Davids, 1955). 

It is also possible that actual neurotic tendencies, as differen- 
tiated from personality differences, are being demonstrated in the 
higher scores occurring on the above-discussed scales among the 
dropouts. The influence of neuroticism is also suggested by the 
appearance of the Hs and Hy scales in combination to differen- 
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tiate dropouts in the headache (F01), stomachache (F02), and 
the muscular aches and pains (F04) categories. While Eriksen and 
others (Eysenck, 1957) would consider those scoring high on the 
Pt and Hy scales at diametrically opposite poles on such per- 
sonality dimensions as manifest anxiety, introversion-extraversion, 
and optimism-pessimism, Eysenck (1957) has also shown that 
neuroticism constitutes a separate dimension which can be common 
to such divergent personality groupings. 

Those categories which had a less subjective diagnosis for deter- 
mining causes for dropping out—i.e., C00 (stomach and abdominal 
cramps) and E00 (water deprivation, a time-dependent diagno- 
sis)—show generally less marked departures from the nondrops 
on the MMPI variables, Category A00, however, with the most 
specific dropout diagnosis (rectal temperature of 103° or more): 
shows dramatic differences from the nondrops. Their higher scores 
on the F, Pd, and Pr scales and low scores on Es and Do suggest 
that group A00 differed from the nondrops in the direction of in- 
adequate personality development and characterological weakness. 
This pattern is also evident in the E02 dropout group (stomach- 
ache, etc.). In the case of the A00 group, it is difficult to explain 
how these characteristics could have resulted in higher rectal tem- 
peratures. It may be that these dropouts used poorer judgment 
in pacing themselves, tended to be burdened with greater loads, 
or maliciously exerted high effort for a short period to raise their 
body temperatures (disregarding the dangers involved) to elimi- 
nate themselves from the taxing field effort. Or it may be that 
lability of the physiologic system is associated with immature 
character development in adults, 

The frequency with which low Scores on the Do scale discrimi- 
nate categories of dropouts is as impressive as the role played by 
the Pt scale among the basic (clinical) scales. Along with higher 
Scores on the Es scale, these results seem to indicate that the 


nondrops had greater strength of character and self-confidence, 


з Preselected. individuals wore indwelling thermistor i 
{ catheter 
running out over the belt at the back. These were SOT пи 
medical monitors with immediate temperature-readout devices, 
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ported by Knapp (1960) who found significant differences on the 
Dominance scale between Marine Corps officer pilots and Marine 
Corps enlisted men. Knapp, however, also found similar differences 
on the Social Responsibility scale, whereas the Re scale was non- 
diseriminating in this study for valid profiles. Probably, the social 
responsibility scale reflects a high measure of intellectual responsi- 
bility and cultural maturity and, in a group that is relatively homog- 
enous in these respects, no differences could be expected among sub- 
samples of the group. 

The Lb scale, which was created among other things to predict 
malingering within military populations, was absolutely nondis- 
criminating in this study. The Es scale, as a general measure of 
psychological health and constructive forces within the personality, 
might have been expected to differentiate more of the categories of 
dropouts. It is believed that the Lb scale, and to a less extent the 
Es scale, were not efficient discriminators in this study because 
their validation has usually involved sick populations already un- 
der treatment, and responses on the scales reflect to a considerable " 
degree the patient's attitude toward his illness and the treatment 
he is undergoing, regardless of the basis of his illness. Thus, Him- 
melstein (1964) has shown that there is a difference between clini- 
cal and nonclinical college populations on the Es scale, and it 
might be reasonable to assume that reliable variability on per- 
sonality dimensions found in patient populations may not exist to 
the same degree in a nonpatient population. 


Summary 


The MMPI was administered to 777 U. S. Army soldiers before 
they participated in field trials testing protective clothing which 
placed a severe physiological stress on the individual. In some 
instances, over half the soldiers were not able to complete field 
trials in which they were participants. This created a binary 
dropout-nondrop criterion measure. In addition, several mutually 
exclusive categories were created to classify reasons given for drop- 
ping out. Reliable differences in distributions on 16 of 30 MMPI 
variables—consisting mostly of the validity, basic, and special 
scales—were found for the binary criterion. Nineteen of the 30 
MMPI variables showed reliable differences for at least one of 
the subcategories of dropouts when compared with the general non- 
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drop group. These differences occurred on scales which seemed to 
measure anxiety and neuroticism and strength or weakness of char- 
acter. Differences were especially prevalent on the Pt and Do 
scales. It was suggested that expected differences were not found 
on some scales because there is insufficient consistent variability 
on them in a homogenous, normal population. Over 14 per cent of 
the obtained MMPI protocols were considered invalid because of 
an excessively high dissimulation index. While these were not used 
in the foregoing analysis, there were indications that giving an 
invalid record was positively associated with dropping.out and 
negatively associated with age, education, intelligence, and mili- 
tary rank. These demographic variables were also inversely related 
to the relative frequency of dropouts. 


REFERENCES 


C. T. An Experimental and Theoretical 
Vg Journal of Abnormal and Social 


А -Psyehasthenia 
chology,1955,50,135-137. ^^^ of Abnormal and Social Psy- 


Evidence on the Ego St; 
Measure of Psychological Health, Journ еа 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1967, 27, 631-642. 


A CANONICAL CORRELATIONAL ANALYSIS OF THE 
STRONG VOCATIONAL INTEREST BLANK AND 
THE MINNESOTA MULTIPHASIC PERSONALITY 

INVENTORY FOR A FEMALE COLLEGE 
POPULATION? 


GEORGE H. DUNTEMAN 
Educational Testing Service 
AND 
JOHN P. BAILEY, JR. 


College of Health Related Professions 
University of Florida 


Совете (1964) has reviewed some relationships between both 
the Strong Vocational Interest Blank (SVIB) and the Kuder Pref- 
erence Record and various personality scales. He points out that 
most authorities agree that personality and interest characteristics 
are significantly correlated but not high enough to substitute in- 
terest appraisals for personality appraisals. Since most studies have 
involved the consideration of individual correlations between se- 
lected personality and interest scales, it is hard to grasp the overall 
relationship between a personality domain and an interest domain, 
i.e., the relationship between all scales of both a personality and an 
interest test. Selected correlations out of an intercorrelation matrix 
involving interest and personality variables disclose little or noth- 
ing about the overall relationships between the two sets of mea- 
sures. » 

The purpose of the present investigation was to determine 
through the use of canonical correlational analysis the extent to 

1 This research was supported in part by a research grant RD-1127 from the 
Vocational Rehabilitation Administration, Department of Health, Education, 
and Welfare, Washington, D. C. Part of the data for this study was collected 


under NIMH Project Grant, 380, the Public Mental Health Methods in a Uni- 
versity. 
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which the Minnesota Multiphasic Personality Inventory (MMPI) 
and the SVIB for Women were related and how many common 
factors there were underlying the relationships between these two 
sets of variables. 

Cottle (1950) factor analyzed an intercorrelation matrix of 11 
basic MMPI scales, four scales of the Bell Adjustment Inventory, 
six Strong group scores, three Strong nonoccupational scales, and 
ten various scores on the Kuder Preference Record. The sample 
consisted of 400 adult males tested shortly after World War II. 
Seven common factors emerged after rotation. Two of these factors 
were common to the two personality inventories and five of the 
factors were common to the two interest inventories. Except for a 
moderate loading of the Bell Social Key on one of the interest 
factors, there was no overlap between the personality and interest 
inventories. 

In a more recent investigation, Anderson and Anker (1964) 
factor analyzed an intercorrelation matrix composed of 12 basic 
MMPI scales, 11 occupational group scores of the SVIB, and three 
nonoceupational scales of the SVIB. The sample consisted of 107 
psychiatric patients. The factor analysis yielded six rotated ob- 
lique factors. Two of the factors were defined by MMPI variables 
and four were defined by SVIB variables. The MMPI factors were 
similar to those found in other studies. It was interesting to note 
that several of the factors were significantly intercorrelated, in- 
cluding significant correlations among personality and interest fac- 
tors, An oblique solution would seem to be more appropriate than 
an orthogonal solution since the interest lies in the possible inter- 
Piece among the interest and personality factors as they 

pirically exist. However, these correlations were rather low and 


consequently were of a more theoretical than practical interest. 


i pe results of these two studies tended to indicate that there 


little or no overlap between the MMPI and SVIB. However, 
Mp RN analysis is a more appropriate way in which 

st hypotheses involving the possibl i у 
а р e overlap or intercorrela 


f variables. In factor analyzing the MMPI 
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inter-test correlations, one would expect to have two independent 
sets of factors, personality and interest. Only when some of the 
inter-test correlations are higher than the intra-test correlations 
would one expect to find heterogeneous factors. For all practical 
purposes, similar factors would have been obtained if each set of 
test scores had been factor analyzed separately. 

Canonical correlation involves finding the linear combination of 
one set of variables (e.g. MMPI) and the linear combination of a 
second set of variables (e.g., SVIB) that will result in a maximum 
correlation between the two linear functions. It is similar to multi- 
ple correlation except that multiple correlation involves a maxi- 
mum correlation between one variable and a linear function of 
another set of variables. The number of canonical correlations is 
equal to the rank of the intercorrelation matrix among the two 
sets of variables. For example, the rank of this matrix for the 
present investigation is ten, the number of the MMPI scales. 

It should be noted that in factor analysis we are interested in 
the rank of the total intercorrelation matrix which in the present 
instance is equal to 39, the number of SVIB scales used, plus the 
number of MMPI scales. Factor analysis, then, is concerned with 
locating reference vectors in the total test space of both sets of 
variables while canonical correlational analysis involves locating 
separate vectors in the test space of each set of variables which 
will have a maximum correlation between them. While individual 
variates themselves might not covary substantially across two sets 
of test variables and consequently not have significant projections 
on the same factors, transformed variables (ie., linear functions 
of the original test variables) such as those obtained in canonical 
correlation may result in high intercorrelations between the two 
sets of variables. 

Bartlett (1948) distinguishes between factor analysis and ca- 
nonical correlational analysis by referring to the factor analysis 
of a single set of variables as internal factor analysis and referring 
to canonical correlational analysis as external factor analysis. By 
external factor analysis, he means that the canonical transforma- 
tions of one group of variables into factors is made with reference 
to a second set of variables. In the present situation, the factors of 
the MMPI were extracted with reference to the SVIB. If the SVIB 
and MMPI had been factored separately, then two sets of factors 
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would result which could conceivably be independent from one 
another. This is because an internal criterion for factor extraction 
was utilized for each of the two sets of variables. This is basically 
the reason for preferring canonical correlational analysis over fac- 
tor analysis in looking at the interdependence between two sets of 
variables. Horst (1961) has generalized the canonical correlational 
analysis procedure for use with more than two sets of variables. 


Method 


Ten clinical scales of the MMPI and 29 scales of the SVIB for 
Women were subjected to a canonical correlational analysis. The 
specific scales of both tests are shown in Table 1. The sample was 
composed of 182 predominantly sophomore female college students 
who were enrolled in an introductory course entitled “Introduction 
to the Health Related Professions” given in the College of Health 
Related Professions at the University of Florida. 

These data were collected as part of a continuing research pro- 
gram concerned partially with determining personality, aptitude, 
interest and other differences and similarities among students en- 
tolled in various health and health related professions (e.g., Oc- 
cupational Therapy, Physical Therapy, and Nursing). The char- 
acteristics of the present sample as well as other research findings 
concerning the MMPI and SVIB are described in detail by Dunte- 
man, Anderson, and Barry (1966). 

The intercorrelations were computed between all 39 variables 
and the total matrix was partitioned and entered into a deter- 
Fay equation whose solution gave the canonical correlations 
: ibd MMPI and SVIB and their associated canonical vec- 

д ize of the canonical correlations ini 
of relationships between common factors 
MMPI and SVIB. The number of Significant 
indicates the number of со 
variables, 

Canonical correlational 
dependent factors which i 


dicates the magnitude 
underlying both the 
canonical correlations 
mmon factors underlying both sets of 
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out of the second set of variables are likewise independent. The 
association between two sets of variables is due to the influence of 
factors common to both sets of variables. 

Certain factors may influence both sets of variables while other 
factors may be specific to one or the other set of variables. If no 


TABLE 1 
Intercorrelations between the SVIB and MMPI (М = 182) 
MMPI 
Psycho- 
Hypochon- pathic Masculinity 
SVIB driasis Depression Hysteria Deviate Feminity 

. Occupational 

Therapist .12 .04 —.08 —.08 —.12 
. Laboratory 

"Technician .02 .02 —.04 —.01 .05 
. Housewife —.07 .02 —.07 —.13 —.14 
. Stenographer 

Secretary —.16* —.04 —.08 —.06 —.09 
. Physician .09 .04 .08 .05 .16* 
. Social Worker .13 —.01 .21** .04 —.06. 
. Artist .10 .14 .05 —.01 —.01 
. Author .12 .15* .10 .03 .04 
. Business Education 

"Teacher —.14 —.04 —.13 —.01 —.05 
. Buyer —.12 .02 —.16* —.02 —.04 
. Dentist .03 .06 —.09 -00 .10 
. Dietician —.03 —.12 —.12 —.04 —.06 
. Elementary Teacher —.02 .00 —.06 —.20** —.24** 
. English Teacher .13 .07 .08 —.09 —.09 
. Home Economies 

"Teacher .00 —.10 —.08 —.15* —.18* 
. Life Insurance 

Saleswoman —.06 —.19** .05 .13 .15* 
. Lawyer .00 —.13 .02 .19** .18* 
. Librarian .02 .10 .00 —.02 05 
. Mathematics & 

Science Teacher .00 .04 —.06 —.09 —.02 
. Music Performer .09 -—.01 „11 —.08 —.16* 
- Music Teacher .15* —.04 .12 —.08 —.17* 
. Nurse .07 .08 —.06 —.19** —.06 
- Office Worker —.16* —.04 —.13 —.03 01 
a Physical Education 

Teacher (College) .15* —.08 .06 .01 .10 
. Physical Education 
- Teacher —.11 —.17* .01 —.14 .03 
. Psychologist .15* ESL .06 .08 -10 
- Social Science 

Teacher .06 —.02 —.08 —.08 —.08 
- Y.W.C.A. Secretary .20** —.05 .09 .04 .08 
- Masculinity- 


Feminity .07 .06 =? =н —:16* 
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TABLE 1 (Continued) 


Intercorrelations Between the SVIB and MMPI (М = 182) 


MMPI 
Manic- Intro- 
SVIB Paranoia thenia Schizophrenia Depressive version 
1. Occupational 
Therapist .19** .00 .05 —.04 01 
2. Laboratory 
"Technician —.03 .07 .03 —.12 .10* 
3. Housewife .12 .04 —.14 —.02 13 
4. Stenographer 
Secretary —.09 .03 —.13 .08 .08 
5. Physician —.08 —.05 .04 —.07 —.02 
6. Social Worker .10 .02 .07 Bt] —.28%* 
T. Artist .10 .06 .03 —.04 .12 
8. Author } .01 .06 .03 —.01 08 
9. Business Education 
"Teacher —.06 .00 —.11 —.03 .13 
10. Buyer —.09 —.05 —.16* .07 .08 
11. Dentist .00 .04 .06 —.09 .20** 
12. Dietician .04 —.03 —.03 —.01 —.06 
13. Elementary Teacher ло .01 —.13 —.03 .09 
14. English Teacher 210 .03 .01 —.01 .09 
15. rud Economies 
cacher -. = —.07 
16. Life Insurance e s we o E 
Saleswoman —.09 —.12 .05 16°,  -—.25** 
17. Lawyer —.16* Puro 12 15* —.25** 
18. Librarian —.02 —.08 — 06 —.15* j19** 
19. уь & n { y 
ience Teacher Я т. тә ** 
20. Music Performer ge zi m P^ is = EC 
21. Music Teacher aT 03 706 ‘07 —17* 
22. Nurse 10 “03 E 102 “06 “06 
23. Office Worker —.09 —01 -ii 7105 12 
24. уа ney У £ м 
eacher (Colle eu УЯ, s 
25. Physical кшш i и Ly си 20 
vs —.01 Lg RE ” 
26. Ps; logi: d .05 09 —.21 
Pape О tss. 02 :15* a 175 
Teacher 
—.01 —.06 = = —.08 
28 Y.W.C.A. Se 5 05 -05 0 
29. Masculinity. V и 08 Q0 —23* 
Femini 
ты :08 01 —.04 -.02 .01 


“Significant at the .05 level 


of significance, 


“Significant at the .01 level of Significance, 
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Canonical correlation determines the smallest number of such 
common factors between two sets of variables. Although the num- 
ber of factors is likely to be large, i.e., the rank of the intercorrela- 
tion matrix between the two sets of variables which in the present 
case is ten, it is likely a small number of common factors will 
account for a large portion of the association between the two sets 
of variables. The situation is analogous to principal components 
analysis, where usually the first few factors account for the ma- 
jority of the variance of the variables. The major contribution of 
the intercorrelation matrix between the two sets of variables comes 
from the larger canonical correlations. 

Cooley and Lohnes (1962) point out that geometrically the ca- 
nonical correlation can be interpreted as a measure of the extent to 
which people occupy the same relative positions in the test space of 
the first set of variables as they do in the test space of the second 
set of variables. 


Results and Discussion 


Table 1 shows the intercorrelations between the MMPI and 
SVIB variables. The intercorrelations among the MMPI variables 
and the intercorrelations among the SVIB variables are not shown 
since we are primarily concerned with relationships between rather 
than the relationships within the two sets of variates. 

The correlations significant at both the .05 and .01 level are 
indieated in Table 1. Forty-six out of the 290 correlations were 
significant at the .05 level of significance, and 18 of these were 
significant at the .01 level. Only 15 correlations would be expected 
to be significant at the .05 level on the basis of chance alone. 
Therefore, on the basis of examining the individual correlations 
between these two sets of variables, it seems reasonable to expect 
some underlying dimensions accounting for the relationships be- 
tween the MMPI and the SVIB. 

It is interesting to note that both the MMPI M-F and Si scale 
have a large proportion of the significant correlations with the 
SVIB scales. The MMPI M-F scale has been interpreted by some 
investigators as a measure of interest. Twelve of the 46 significant 
correlations are between the Si scale and various SVIB scales. 
Furthermore, these correlations are the largest of any column of 
the intercorrelation matrix. This seems reasonable when one con- 
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siders that these interests can be placed upon a continuum involv- 
ing introversion-extroversion. The pattern of significant intercor- - 
relations between the remaining MMPI scales and SVIB scales 
is congruent with theoretical expectations. 

Table 2 indicates the latent roots, the corresponding canonical t 
correlations and their associated tests of significance. The canonical | 
correlations were tested for significance by a Chi Square approxi- 
mation, (see Cooley and Lohnes, 1962). The first canonical corre- 
lation of .69 was significant beyond the .0001 level of significance 
while the second canonical correlation of .61 was significant beyond 
the .025 level. The remaining eight canonical correlations were not 
significant at the .05 level. This suggests that there are two ways in - 
which the MMPI variables and the SVIB variables are related to 
each other, There are two factors common to both sets of variables 
that underly the intercorrelations among the two sets of variables. 


TABLE 2 
Canonical Correlations and Their Associated Tests of Significance 
Largest ^ Canonical Degrees 
Roots Root Correla- Chi of Proba- 
Removed Remaining ^ tion A Square Freedom bility 
0 „478796 .692 .08121 406.75 290 <.0001 
1 „371562 .610 .15948 297.89 252 <.025 
2 260548 .510 .25377 222.16 216 2.05 
3 249133 .500 — — 182 2.05 
4 „195390 .442 — — 150 2.05 
5 160369 408 — — 120 2.05 
6 .121065 .348 — — 92 2.05 
7 „101885 „319 — — 66 >.05 
8 „075742 .275 — -— 42 >.05 
9 066043 .257 — — 20 2.05 
20 5.088 


canonical correlations associated with these vectors were not sig- . 

nificant. For purposes of discussion, let us assume that we are try- 
ing to predict interests from personality. The MMPI will then be 
our predictor and the SVIB our criterion. From Table 3, it can be 
seen that a linear combination of the MMPI highly negatively 
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weighted by Si, predicts the score of a linear function of the 
SVIB with relatively high positive and negative weights on a 
number of interest scales which seem to reflect an extroversion- 
introversion dimension. 

The MMPI canonical variate, then, seems to reflect introversion 
while the positive pole of the associated SVIB canonical variate 
seems to reflect occupations that involve components of extrover- 
sion (e.g., Physician, Lawyer, Social Worker, Life Insurance Sales- 
woman scales, etc.), and the negative pole has relatively high 
loadings of professions that could be considered to involve com- 
ponents of introversion (e.g., Business Education Teacher, Dentist, 
Author scales, etc.). Basically, the MMPI factor reflects introver- 
sion while the SVIB factor reflects occupations, which for the most 
part could be ordered on an introversion-extroversion dimension. 
When looking at the individual correlations of the 17 SVIB scales 
with loadings of .20 or more on the first canonical variate for the 
SVIB, we find that in 14 out of 17 instances, their correlation 
with the MMPI Si scale was in the expected direction. These two 
factors correlate .69. This indicates that one underlying dimension 
common to both the MMPI and SVIB seems to be an introversion- 
extroversion dimension. 

The second pair of canonical variates is more complex and hence, 
more difficult to interpret. In any event, the MMPI canonical 
variate is bipolar with relatively high weights for Si and Pd on 
the positive pole and relatively high weights for D and Hs on the 
negative pole. This bipolar factor is interpreted as reflecting re- 
belliousness (nonconformity) and introversion versus a tendency 
toward somatic overconcern. It is difficult to make inferences con- 
cerning the correspondence of the positive pole of the MMPI ca- 
nonical variate and the positive pole of the SVIB canonical 
variate. However, the interest scales loading relatively high on 
the positive pole of the SVIB factor concern to some extent the 
satisfying of economic or utilitarian motivations (e.g., Life Insur- 
ance Saleswoman, Dentist, Business Education Teacher). There 
seems again to be a component of introversion here (e.g., Dentist 
and Business Education Teacher). 

On the other hand, the negative pole of this factor seems to 
reflect, interests involving social service or support as opposed to 
economie or utilitarian motivations (e.g., Nurse, Social Worker) 
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TABLE 3 
The Canonical Vectors Associated with the Two Largest Canonical Correlations 


Second Canonical 


First, Canonical Correlation Correlation 
Scale MMPI Scale SVIB Scale MMPI Scale SVIB 
1. Hypochondriasis —.14 1. Occupational —.18 1. —.52 25 .26 
2. Depression ‚20 "Therapist 2. —.75 
3. Hysteria .21 2. Laboratory .09 3. 20 2. -.0 
4. Psychopathic —.28 "Technician 4. 64 
Deviate 3. Housewife —.09 3. —.04 
5. Masculinity- .03 4. Stenographer- | —.30 5. .32 4 —.06 
Femininity Secretary 
6. Paranoia 15 5. Physician .56 6. —.30 5, .19 
7. Psychasthenia —.29 6. Social Worker S1 7. -. 6. -.4 
8. Schizophrenia .42 T. Artist —.15 8. 41 7. —@ 
9. Manic-Depressive .20 8. Author —.89 9. =.20 8. —: 
10. Introversion —.92 9. Business Educ. —.79 10. .67 9. E 
"Teacher 
10. Buyer —.36 10. —.78 
11. Dentist —.55 11. A 
12. Dietician .32 12. —.4 
13. Elementary —.09 13. —.И 
"Teacher 
14. English .08 14. .00 
Teacher 
15. Home Econ. .88 15. 
"Teacher 
16. Life Insurance .20 16. 48 
Saleswoman 
17. Lawyer 47 17. 21 
18. Librarian —.03 18. 
19. Math & Science —.24 19. —.46 
"Teacher 
20. Music 17 20. 03" 
Performer 
21. Music Teacher .28 21. —.Ш 
22. Nurse —.15 22. —.% 
23. Office Worker .57 28. 
24. Phys. Ed. Tchr. .00 24. -.0 
(College) 
25. Phys. Ed. .25 25. E! 
"Teacher 
26. Psychologist —.33 2. 446 
27. Social Science — —.28 27. —% 
Teacher 
28. УМ.С.А. —.20 28. Ж 
Secretary 
29. Masculinity- —.09 29. 05 
Femininity 
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and health or physical body interests (e.g., Nurse, Physical Edu- 
cation Teacher, Dietician). The negative pole of the SVIB factor 
then corresponds fairly well with the negative pole of MMPI fac- 
tor (i.e., bodily and health concerns). 

In general, the results suggest that there are two common fac- 
tors underlying both the MMPI and SVIB, one reflecting 
introversion-extroversion and the other less clear but involving 
partly concern for self-gratification versus concern for others. Fur- 
thermore, these two sets of factors correlate to & much higher 
extent than any of the individual correlations across the two test 
batteries. This investigation suggests two ways in which person- 
ality can be looked upon as related to interests. 

Another meaningful interpretation of the above analysis is as 
follows. High scores on both of the MMPI canonical variates can 
be interpreted as reflecting deviant MMPI profiles. Similarly, high 
scores on the two SVIB canonical variates are largely determined 
by interests that are somewhat deviant from the commonly ac- , 
cepted female role (e.g., Physician, Lawyer, Life Insurance Sales- 
woman, and Dentist). Since the correlation between each of the 
two sets of variates is significant, there is a suggestion that deviant 
personality profiles are related to deviant interest patterns. The 
same interpretation applies to low scores on both of the sets of 
variates, but, to a lesser extent. This viewpoint is in accordance 
with Berg's (1955) deviation hypothesis which states that the tend- 
ency to make deviant responses in one area of behavior (e.g., 
responses on a personality test) may be related to the presence of 
deviant responses in another area of behavior (e.g. responses on 
an interest inventory). Furthermore, Berg (1959) has emphasized 
the unimportance of test item content and probably would argue 
against attaching any labels to the canonical variates as was done 
in the present investigation. 

The results of this investigation suggest that perhaps there is 
more of an overlap between inventoried personality and interest 
traits than has hitherto been supposed. Canonical correlational 
analysis as opposed to factor analysis seems to bring out these 
similarities. It would be worthwhile to examine the relationships 
between the SVIB and other personality instruments developed 
specifically for normal populations in order to see if the same 
relationships hold true. 
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It should be pointed out, however, that the sample was a homo- 
geneous sample of predominantly sophomore female college stu- 
dents enrolled in an introductory health professions course and 
that consequently the relationships between the MMPI and SVIB 
found in this investigation may be limited in their generalization 
to other populations. I& could be that the homogeneity of this 
sample might have restricted the size and number of canonical 
correlations between the SVIB and the MMPI. 


Summary 


A canonical correlational analysis was performed between the 
10 MMPI and 29 SVIB scales for a sample of 182 predominantly 
sophomore college students. Two significant canonical correlations 
were found and interpreted. The findings of previous factor ana- 
lytic studies involving both the MMPI and SVIB were discussed 
and some differences between the factor analytic and canonical 
correlation approach to examining relationships among sets of var- 

" jables were pointed out. 
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THE FACTOR STRUCTURE AND CONTENT OF 
PERCEPTIONS OF DESIRABLE CHARACTERISTICS 
OF TEACHERS: 


FRED N. KERLINGER 
New York University 


Wuar is an effective teacher like? What does she do? Unfor- 
tunately, we do not really know. As Mitzel (1960) has said, “More 
than half a century of research effort has not yielded meaningful, 
measurable criteria. . . . No standards exist which are commonly 
agreed upon as the criteria of teacher effectiveness.” Ryans (1960a) 
agrees, 

Most educators would say that teachers should be intelligent, 
warm, sympathetic, reliable, efficient, resourceful, flexible, and so 
on through a long list of “good” characteristics (Barr, 1950; Mit- 
zel, 1960; Ryans, 1960b). No single teacher, however, can possess 
all, or even most, of these. traits. The list must be narrowed down. 
But which traits especially characterize the “good” or effective 
teacher? Would we all agree that teachers must be sensitive, warm, 
intelligent, moral, and conscientious above all other traits? Or 
must teachers be thorough, reliable, sympathetic, and loyal? 


"This study is part of Project No. 5-0330, supported by the Cooperative Re- 
search Branch, U. S. Office of Education, Department of Health, Education, 
and Welfare. Professors G. Kowitz, University of Houston, R. Sommerfeld, 
University of North Carolina, and T. Linton, University of Wisconsin at Mil- 
waukee, administered scales in Texas, North Carolina, and Wisconsin. The data 
analysis was accomplished with the cooperation of the personnel of the Com- 
puting Center, Courant Institute of Mathematical Sciences, New York Uni- 
versity. Mr. R. Buhler, Princeton University, was kind enough to alter his gen- 
eral computer program, PSTAT, to accomplish second-order factor analysis 
in the manner described below. Dr. E. Pedhazur, New York University, helped 
to process, program, and run data through the computer. He also read the early 
drafts of this paper. I am grateful to these individuals for their valuable help. 
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In an earlier study (Kerlinger, 1966) in which 36 educator 
judges sorted a set of 90 adjectives selected for possible relevance 
to teachers and teaching, three persons factors emerged. The items 
of the factor arrays calculated from the factor loadings of those 
judges substantially loaded on Factor A consisted of adjectives 
that seemed to characterize a “progressive” teacher: imaginative, 
insightful, warm, flexible, and so on. Factor B's items, on the 
other hand, epitomized what seemed to be a “traditional” teacher: 
conscientious, moral, efficient, just, self-controlled, among other 
traits. The nature of Factor C was not readily categorized. It con- 
sisted of adjectives like enthusiastic, inquisitive, decisive, purpose- 
ful, and sincere. The main point is that there were three distinct 
factors and thus three kinds of judges, or three different perceptions 
of the “good” teacher. 

This study continues, in an R methodological framework, the 
earlier research. It concentrates on the traits thought to be im- 
portant for teachers to possess; the question of the traits that 
teachers actually do possess is not considered. The basic questions 
are: 


1. What factor or factors underlie perceptions of the desirable 
traits of teachers, and what is the nature of the factor struc- 
ture of such perceptions? 

2. Are the factor structures behind perceptions of desirable traits 
of teachers and the factor arrays associated with the factor 
structures invariant over different samples? 


The first question, of course, is the more important one. It implies 
the number of factors, the structure of the factor space, the content 
of the factors, and the relations among the factors. We also ask, 
in connection with the relations among factors: Are there second- 
order factors, and, if so, what is their nature? The second question 
is methodological; it will be discussed later. 

It was hypothesized in the Q study that two factors would 
appropriate most of the common factor variance and that these 
factors would be congruent with “progressive” and “traditional” 
notions of education and teaching. The same hypothesis is tested 

2 A factor array, in В methodology, is simply a list of the tests or items that 


have high loadings on a factor. An array, in Q methodol: is th t 
that it is derived more indirectly, (See Stephenson, 1953, pp. 174) Н “tt 
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. in this study. The reasons for the hypothesis have been given 
elsewhere (Kerlinger, 1963; 1966) and will not be elaborated here 
except to say that the Базе assumed duality of educational atti- 
tudes (Kerlinger and Kaya, 1959; Kerlinger, 1963; 1966) was 
thought to be reflected in perceptions of teacher characteristics. 
It was also expected that the same factors would emerge in differ- 
ent samples and that the Q factor arrays of the previous study and 
the arrays of the present В study would be similar. 

The content of the factors was expected to reflect the content of 
educational attitudes factors, the progressivism attitude factor be- 
ing reflected in one trait factor, and the traditionalism attitude 
factor being reflected in another trait factor. Furthermore, since 
Ryans’ Xo, Yo, and Zo patterns (1960a, pp. 102.) seemed to be 
reflected in the Q arrays of the earlier study, it was expected that 
they would also be reflected in the trait arrays of this study. 


Method 
The Scale: Teacher Characteristics Scale I (TC-I) 


A 38-item summated-rating type scale, ТО-І, was constructed 
from the 90 items of a Q sort used in the study mentioned earlier 
(Kerlinger, 1966). The 90 items had been selected from a pool of 
350 to 400 traits originally culled from the Allport-Odbert (1936) 
list of some 18,000 traits on the basis of their relevance to the 
teaching function and from Barr's (1950) and Charters and 
Waples’ (1929) lists of traits. The criteria used for the selection 
of the Q-sort items were: apparent validity, as indicated by a 
factor analysis of the responses of 38 judges to a Q sort made up 
of the Charters and Waples noun-traits; representative sampling 
of the teacher trait domain; applicability to the general teaching 
situation, but with particular emphasis on elementary and second- 
ary levels; relative lack of ambiguity;  nonrepetitiveness; 
behavioral-operational relevance. In addition, all adjectives with 
negative connotations (e.g., careless, lazy) or that obviously “im- 
ply” a good teacher (e.g. effective, constructive) were eliminated. 

In the Q study, factor arrays were constructed using those judges 
whose loadings were .40 or greater, these loadings appearing on 
one factor only. The arrays of Factors A and B produced 22 items 
that were used in TC-I. Items lower in array values were added to 
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bring the number to 14 in each of A and B. Ten items with middle 
values—4, 5, and 6, on the 11-point Q scale—were added, as buffer 
items, making the item total 38. These items were mixed at random 
in TC-I. 

The instructions told the subjects that they were judges who 
should use the traits to describe the “good” teacher. They were 
further instructed to use the criterion “how important it is for 
teachers to have the traits” and to be general in their judgments 
but when in doubt to “think of the public school teacher.” The use 
of the whole scale of numbers from one through seven was empha- 
sized to counteract the tendency to use only the higher (more 
favorable) numbers. 


Samples 


TC-I was administered to five samples, each consisting of teach- 
ers, or graduate students of education, or both: (1) New York 
(N = 181), (2) New York (N = 313), (3) North Carolina (М = 
404), (4) Texas (V = 480), and (5) Wisconsin (V = 218). The 
second New York, the North Carolina, and the Texas samples 
together form the basic sample of the study (V = 1197) because 
each of its components was sufficiently large to warrant independ- 
ent factor analysis. Except to report basic statistics (Table 1), the 


first New York sample and the Wisconsin sample are not con- 
sidered in this report. 


Analysis 


Means, Standard Deviations, and Reliabilities, The means, 
standard deviations, correlations between the A and В total scores, 
and internal consistency reliability estimates of the five samples 
are reported in Table 1, Three estimates of reliability, odd-even, 


successful scales, TC-VIII, will be describe 
scribed in another report. 


— — 
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average-r, and alpha, the generalized variance estimate (Guilford, 
1954, pp. 377-386; Cronbach, 1951), were calculated. The esti- 
mates of the three methods were alike: the largest difference was 
.038. The alpha estimates are given in the table. Evidently the A 
and B subscales have substantial reliability. 


TABLE 1 


Means, Standard Deviations, Reliability Coefficients, and Correlations 
between Factor Scales: TC-I and TC-VIII 


A B 
Sample* N M в rib M 8 Tu TAB 
N. Y. 131 5.37 .76 .84 5.45 .86 .86 27 
N. Y. 313 5.53 .72 .82 5.12 .82 .83 40 
N. C. 404 5.28 .69 .80 5.45 .68 .79 51 
Tex. 480 5.21 .74 .83 5.59 .69 .80 61 
Wisc. 218 5.26 .68 .77 БОЯХ .84 
Lp 298 5.07 .87 .80 5.22 .84 .82 .98 
Ind. 159 4.94 .69 .69 УК .23 


"The first five lines are TC-I statistics; the last two lines are TC-VIII statistics. 

bAlpha reliability coefficients, 

The means, standard deviations, and reliabilities of the five sam- 
ples are quite similar. It was expected that the A trait means 
would be higher than the B trait means because of the presumably 
higher social desirability values of the A adjectives (see Table 5 
for examples of the A and B adjectives). The tabled means, how- 
ever, do not show the expected discrepancy. Evidently B traits 
are equally and highly valued on the average. This is probably a 
reflection of the notion that teachers should possess all desirable 
traits; they are, or should be, personified virtue. 

The reliabilities, too, are surprising: they are higher than ex- 
pected. To obtain reliabilities in the .80’s for judgments of single 
adjectives—and note that the A and B reliabilities were calculated 
separately, each on the basis of 14 items—is worthy of special 
notice. The correlations between the A and B subscales, all posi- 
tive and some of them substantial, were not as surprising, even 
though relative independence was expected because of the factor 
mode of item selection. 

Factor analysis. The responses to the 38 A, B, and N items of 
the second New York, the North Carolina, and the Texas samples 
were intercorrelated and factor analyzed separately, using first- 
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and second-order factor analysis. The purposes of the first-order 
factor analysis were to test factorial invariance and to study the 
structure and content of TC-I in the usual way. The first-order 
analysis, in other words, would help to answer the first question 
asked earlier. The purposes of the second-order analysis were to 
study the structure underlying the first-order factors and the rela- 
tions among the factors and to determine, if possible, the nature 
of the second-order factors. The second-order analysis, then, 
should, if successful, enrich the answer to the first question. 

In the first-order analysis, the principal axes method, with R? 
as estimated communalities (Harman, 1960, Ch. 9 and p. 89) and 
Varimax rotations (Kaiser, 1958), was used. Four factors were 
rotated in each sample on the basis of eigenvalues greater than 
1.00, Humphreys’ rule (Fruchter, 1954, pp. 79-80), and informed 
judgments of the “correct” orthogonal solution. 

To test factorial invariance and to determine the legitimacy of 
combining the samples, the factor vectors of the three solutions 
” were compared using the coefficient of congruence (Harman, 1960, 
p. 257). Because the first-order factor structures of the three sam- , 
ples seemed virtually the same, and since the means and standard 
deviations were also quite similar, the data of the three samples 
were combined to form one large sample of 1197 subjects. This 
was done to minimize error variance and sample specificity. Again, 
four factors were extracted and rotated orthogonally. 

The second-order factor analyses were more complicated. Using 
principal axes factor analysis, with iterated approximations to the 
communalities and oblique Proequamax rotations (Hendrickson 
and White, 1964; Saunders, personal communication), 4, 6, 7, and 
8 factors were extracted and rotated to simple structure. The agree- 
ment among the rotated solutions was visually checked row by 
row. 

The intercorrelations among the oblique primary factors were 
calculated (Thurstone, 1947, Ch. XVIII) , and the R matrices fac- 
tor analyzed with the principal axes method. In each case two 


+ Following a recommendation of Thurstone (1947 

› рр. 367-369), the scores 

e oe EM hon um s 313) Sank were normalized before factoring. The 
e same as those obtained wi 

all analyses reported are those of the raw tue Seca nee Henn 
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second-order factors were extracted and rotated orthogonally to 
simple structure (as nearly as possible). 


Results 


First-Order Factor Analysis 


The coefficients of congruence calculated between pairs of factor 
vectors of the first-order four-factor solutions of the three samples 
are reported in Table 2. Accepting .90 or greater as very good 
agreement, it is clear that, with only one coefficient less than .90, 
factorial invariance seems well-established. 


TABLE 2 


Coefficients of Congruence between Rotated Factor Vectors, 
Three Samples: New York, North Carolina, Texas 


I п ш ТУ 
№. У.-М. С. .96 .96 .96 .90 " 
№. У.-Тех. .93 .97 .98 .88 
N. C.-Tex. .96 .94 .98 .91 


The correlation matrix and the unrotated and rotated factor 
matrices of the combined (N = 1197) sample are given in Tables 
А, B, and С.5 The R matrix is characterized by positive and sig- 
nificant eorrelations (average r — .21) and a relative absence of 
near-zero and negative correlations. This has an important bear- 
ing on all other analyses, as will be seen. Perhaps most important, 
it means that factor separation and differentiation will be difficult, 
perhaps doubtful. 

The eigenvalues, 8.51, 2.56, 1.84, and 1.00, indicate that three, 
four, or perhaps more factors are probably present. If we label the 
factors by the predominance of kind of items loaded on them (A 
or B), two of the four factors rotated were A factors, one was a 
B factor, and one was indeterminate (it had only two significant 
loadings on it). Of the 14 items originally categorized as A, all but 


5 Tables A, B, and C have been deposited with the American Documentation 
Institute. Order Document No. 9412 from ADI Auxiliary Publications Project, 
Photoduplication Service, Library of Congress, Washington, D. C. 20540. Remit 
in advance $1.75 for microfilm or $2.50 for photo-copies, and make checks pay- 
able to: Chief, Photoduplication Service, Library of Congress. 
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two were loaded significantly (> .35) on factors with predomi- 
nantly A loadings. Of the 14 B items, again all but two were loaded 
significantly on the one B factor. Among the 28 A and B items, 
only three loaded on "opposite" factors—on a factor with B items, 
if an A item, and on a factor with A items if a B item—and these 
three items were also loaded on their “own” factors. 

From this evidence, then, the Q method of item selection seems 
to be fairly efficacious. That it leaves much to be desired, however, 
is apparent from the loadings of the 10 N, or presumably “neutral,” 
items. It will be recalled that these 10 items had middle values— 
4, 5, and 6 on an 11-point scale—on the Q factor arrays. But in 
the present study, all of them loaded significantly on one or more 
factors. 


Second-Order Factor Analysis 


With the exception of the second-order solutions of four first- 
, order factors, the second-order analysis was not successful. To de- 
fine second-order factors, of course, a sufficient number of first- 
order factors is required. A major difficulty, however, was that the 
second-order solutions using more than four first-order factors did 
not agree with each other, even though row-by-row comparison of 
the obliquely rotated matrices of the three solutions showed that 
the same general factor structure was present in the three sets of 
data. Therefore, only the relatively clearcut four-factor results are 
presented, 

The correlations among the four oblique primary factor vectors 
of the three samples are given in Table 3. In each case, Factor I 
had both A and B item loadings, Factor II had large B loadings, 
while III and IV had mainly A loadings (except in the Texas 
sample where IV’s loadings were about equally divided between 
A and B). The pattern is the same in the three matrices: low 
positive correlations, with Factors I and II, on the one hand, and 
III and IV, on the other hand, clustering together, and Factor IV 
also sharing variance with Factor I. The rotated matrices of the 


"Agreement between the New York and North Carolina ei; i 
1 arolina eight-factor oblique 
and second-order solutions was good. Between the Texas and the other two 
p lutions, however, agreement was only fair. While the three samples might 
ae been merged to wash out error and sample factor idiosyncrasy, it was de- 
cided to be conservative and to treat the oblique first-order and the second- 
ly. 
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TABLE 3 


Correlations among Primary Factors, TC-I, Four-Factor 
Solutions, Three Samples* 


N. Y. N. C. Tex. 
II ш IV II ш IV п ш тү 
1 24 18 28 35 22 27 25 21 31 
II 09 10 17 29 20 25 


III 48 32 38 
*Decimal points are omitted. See text for descriptions of I, II, III, and IV. 


factor analyses of these three R matrices clarifies the picture and 
shows the substantial agreement among the three sets of data. 
More important, they show that the A and B factors do separate 
in factor space, despite the positive correlations among the factors 
and the lack of clear separation of clusters of correlations. The 
rotated matrices are given in Table 4. The dual factor structure 
and factorial invariance are seen clearly. The A and B factor 
separation is only sharp, however, in the New York matrix. The 
other two matrices show more the effect of the positive correlations 
among most of the items of TC-I, even between the A and B items. 
If the reader will take the trouble to plot the matrices of Table 
4 on two axes, A and B, he will see both the underlying similarity 
and the differences among the three solutions. 


TABLE 4 


Second-Order Rotated Matrices, Four-Factor Solutions, 
TC-I, Three Samples* 


МУ. М. C., Tex., 
М = 318 М = 404 М = 480 
А В А B A B 
I .28 .48 .24 .54 .20 .61 
п .04 47 18 .67 .19 43 
III .68 .09 65 15 69 .20 
IV .66 .22 47 .33 .50 .89 


^Loadings > .35 are considered signiticant. 


Interpretation of First-Order Factors 


The factor arrays of the four factors of the combined sample 
analysis are given in Table 5. Perusal of the three arrays (the 
fourth factor arrays have been omitted; see Footnote a in the 
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TABLE 5 
Factor Arrays, TC-I, Three Samples Combined, N = 1197* 
е 


А B с 
У 
.70 Friendly (А) (.53) .68 Efficient (B) (.60) .67 Imaginative (A) (.61) 
.64 Kind (A) .60 Punctual (B) (.61) -56 Insightful (A) (.34) 
.61 Cheerful (A) .59 Thorough (B) (.55) .55 Flexible (A) (.38) 

.59 Pleasant (A) .57 Industrious (B) (.56) .53 Original (N) (.65) 
.57 Polite (A) -53 Conscientious (B) (.54) .46 Sensitive> (A) (.46) 
+52 Considerate (A) .51 Reliable (B) (.56) .41 Tolerant (A) (.55) 
.49 Sympathetic (A) (.64) -46 Sensible (N) (.58) .41 Warm? (A) (.65) 
.51 Warm (A) (.65) .44 Firm (B) (.54) .39 Humorous? (N) 
.41 Humorous? (№) .43 Healthy (B) .39 Thoughtful? (№) 
.87 Thoughtful (№) „43 Learned (В) (.38) .38 Alert^ (A) 
-41 Religious? (B) .42 Poised (N) (.41) .35 Purposeful^ (A) 
.36 Moral» (B) -41 Progressive (№) .35 Open-Minded (A) (41) 

-41 Self-Controlled (B) (.37) 

.45 Moral* (B) 

.39 Religious? (B) 

.39 Purposeful’ (A) 


ings are those obtained from the analysis of the combined Long Island and Indiana sample (V = 457). 
bThese items were loaded on two factors, j 


table) yield distinctly different impressions. Four psychologists of 
recognized competence in the study of teacher characteristics were 
asked to judge the arrays.” Although different words were used, the 
judgments in essence agreed with each other: “Person-Oriented,” 
“Affective Merit,” “Humane,” and “Positive Social Reinforcement” 
were the expressions used to describe Factor A. The factor, then, is 
named “Positive Person Orientation.” 

The judges’ categorizations of Factor B were: “Responsibility- 
Orientation,” “Managerial Merit,” “Systematic-Orderly,” and 
“Organization for Task Accomplishment.” The factor is named 
“Systematic Task Organization.” 

Factor C was called by the judges “Divergent Thinking,” “Mo- 
tivational Merit,” “Creative-Surgent,” and “Freedom from Func- 
tional Fixity.” These notions seem to be expressed by the name 


FRED N. KERLINGER 653 


“Functional Flexibility.” Factor D was not named because it had 
only two items on it. 

Two of the four judges said that the three factors were like 
Ryans’ Xo, Yo, and Zo teacher characteristic patterns. (One of 
the judges who did not mention the Ryans’ patterns was Ryans 
himself.) 


A Confirmatory Study 


To obtain confirmatory evidence of the factors found with TC-I, 
as well as to construct a relatively short instrument to use in other 
research, another scale, TC-VIII, was constructed.* It was а 
summated-rating scale of 22 items, 11 A and 11 B, the items being 
selected on the basis of the item-total r's and the factor loadings 
of the TC-I data. The B items were all on Factor B in the TC-I 
analysis. The A items, however, were selected from Factors A and 
C because (1) the C items were considered essential to confirm the 
TC-I results, (2) they were considered important in the judgment” 
of desirable teacher traits, and (3) the TC-I evidence showed that 
A and C items were factorially similar (in two-factor solutions, 
they loaded on the same factor). 

TC-VIII was administered, with other scales, to 298 teachers in 
New York (Long Island) and to 159 graduate students of educa- 
tion and teachers in Indiana. The basic statistics are given in 
Table 1 (last two lines). They are quite similar to those of TC-I, 
except for the reliability of A in the Indiana sample, Each item 
was correlated with its respective total (A or B). All items had 
item-total r’s greater than .45 in the Long Island sample and, with 
one exception, greater than .40 in the Indiana sample. 

The data of the two samples were factor analyzed using the 
principal axes method and Varimax rotations. Because the statis- 
tics and factor structures of the two samples were alike, the samples 
were combined (N = 457) and the resulting data factored. Two 
and three factors were extracted and rotated, the former to see if 
the A and B items would all be loaded significantly on separate 


8Actually, three different scales were constructed and used. The results from 
these scales are not directly pertinent to the present study, however. Keyed 
copies of TC-I and TC-VIII are available on request. 
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factors and the latter to see if the three factors, A, B, and C, of the 
TC-I analysis would emerge. 

In the two-factor analysis, all the A items were significantly 
loaded on one factor and not on the other, and all the B items 
were loaded on the other factor and not on the factor with the A 
items, The dual basis of the trait perceptions thus receives further 
confirmation. On the three-factor analysis, all the B items were 
loaded on one factor, while the A items were loaded on two factors. 
With two exceptions—tolerant and Sensitive, which were loaded 
on Factor C in the TC-I analysis but on Factor A in this analysis 
—all the A items loaded as they had in the earlier analysis. 

Discussion 

The results of this study and the Q study that preceded it indi- 
cate that the old question, What are the desirable traits of teach- 
ers? cannot be answered in the form in which it is put. We should 
ask, rather, What traits of teachers do different sets of individuals 

^ believe are desirable in teachers? There would seem to be at least 
three bases of judgment corresponding to the factors described 
earlier. We might ask about a teacher’s orientation to people, her 
task organization or orientation, or her functional flexibility. To 
ask a judge, then, to tell what an effective teacher is like requires, 
for an understandable answer, knowledge of the judge’s basic edu- 
cational orientation and knowledge of the underlying criteria (fac- 
tors) he is using in making the judgments, 

Tf the sample of traits used in the study was adequate, then, it 
can be tentatively said that there are three principal factors under- 
lying perceptions of desirable traits of teachers,® These three fac- 
tors, moreover, resemble Ryans’ Xo, Yo, and Zo patterns. The 
resemblance was apparent in the Q data, but was more marked in 
the present data. Observations of teacher behavior and perceptions 
of traits seem to approach each other through the underlying fac- 
tors of both, 


The positive correlations in the Е matrix and the resulting posi- 


° This statement is not meant to rule out larger m 
sta umbers of factors. The 
three principal factors can of course break down into correlative or comple- 
eed factors. A and C, for example, are correlative factors, as indicated 
cian extraction and rotation of Only two factors and by the second-order 
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tive correlations between factors are worth special note. This char- 
acteristic of perceptions of desirable teacher traits, together with 
the substantial variances of the individual items (the standard 
deviations range from 1.0 to about 1.6), yields the relatively high 
reliabilities of the A and B subscales of TC-I and TC-VIII due, 
no doubt, to the variance summation principle (Cronbach, 1951) 2% 

Nothing whatever has been said, in this report, about the actual 
traits teachers possess or about their actual behavior in classrooms, 
Most discussions of teacher traits do not clearly distinguish be- 
tween the traits teachers do possess and the traits it is believed 
“good” teachers should possess. And the distinction is important. 
Do administrators and board of education members hire and re- 
ward teachers on the basis of traits they are actually observed to 
possess (assuming such observation to be possible)? Or do they 
and laymen react to teachers on the basis of what they believe are 
the traits necessary for effective teaching? In any attempt to deter- 
mine effectiveness of teachers and teaching, a crucial question must 
be asked: how is effectiveness judged—and by whom? Any criteria 
of effectiveness are obviously affected by the criterion setter's edu- 
cational goals. In the present study and in the Q study, the judges’ 
perceptions of the traits of the effective teacher, the perceptions 
they start with, have been shown to differ systematically. In other 
words, the criterion problem is made much more complex and diffi- 
cult by the known presence of different criterion and criterion- 
setter dimensions in different sets of people. 

The questions asked earlier in this report have been answered. 
Many other questions, however, are suggested by the results of 
the research. Why do the A and B traits appear on different fac- 
tors? Do individuals with different philosophies of education “see” 
the effective teacher differently? Do teachers who believe that 
teachers should be primarily warm, kind, and sympathetic teach 
differently than teachers who believe that teachers should be pri- 
marily reliable, efficient, and conscientious? Do perceptions of the 


possible, the separation and invariance of factors over different samples 
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“good” teacher influence administrator and board of education de- 
cisions? Is there any direct connection between the actual posses- 
sion of traits and teachers’ perceptions of the traits? Such questions 
suitably conclude this report. 
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USE OF THE CHILDREN'8 PERSONALITY 
QUESTIONNAIRE IN DIFFERENTIATING 
BETWEEN NORMAL AND DISTURBED CHILDREN"? 


ELISE E. LESSING AN» ALBERT D. SMOUSE! 


Institute for Juvenile Research 
Chicago, Illinois 


Previous work with the Sixteen Personality Factor Question- 
naire (Cattell and Eber, 1964) and with the High School Per- 
sonality Questionnaire (Cattell and Beloff, 1962) has shown that 
various psychopathological syndrome groups present characteristic 
profiles which distinguish them from the normal population. While 
the CPQ was designed to extend the age range of the 16 PF and 
the HSPQ downward, there have been few empirical studies to 
determine whether the CPQ offers comparable discrimination 
among various criterion groups. In the CPQ manual, the instruc- 
tions for the computation of the Neuroticism Score are followed by 
the statement: “At the present, this can be used simply to rank 
individuals on а neuroticism score, based largely on results for 
adults, and it must be left to further clinical research to check this 
and find a standardization with cutting levels indicative of need 
for clinical attention” (Porter and Cattell, 1960, p. 47). Karson’s 
(1965) study of boys with conduct and personality problems, while 
not directly concerned with the issue of instrument validity, con- 


1 А preliminary version of this paper was presented on May 6, 1966, at the 
annual meeting of the Midwestern Psychological Association at Chicago, 
Illinois. 

2 The authors wish to express appreciation to Jack Becktel for statistical 
consultation and to Aldona Vaitkus for computational assistance. We are 
indebted to Irving D. Harris for making psychiatric ratings and to Mary Anne 
Rynerson for helpful comments regarding the manuscript. The generous co- 
operation of Clarence H. Pygman, Superintendent (now retired) of the May- 
wood, Illinois Public Schools is also gratefully acknowledged. 

3 Now at the University of Oklahoma, Norman, Oklahoma. 
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tains data which raise questions about the discriminating power of 
some of the CPQ factors comprising the Neuroticism Score. 

Consequently, the purposes of the present study were: (a) to 
determine how effectively the CPQ differentiates between “normal” 
and “emotionally disturbed” children, and (b) to evaluate the fac- 
tor composition and discriminating power of the CPQ Neuroticism 
Score. 


Method 
Subjects 


The sample of “emotionally disturbed” children consisted of 110 
children from 10 years 0 months of age through 12 years 11 months 
of age who received a psychiatric examination at the Institute for 
Juvenile Research from September, 1963, through June, 1965. 
These children manifested a wide variety of presenting symptoms, 
with learning difficulties, aggressiveness toward peers, excessive de- 
mands for attention, general unhappiness, and disobedience being 
noted most frequently. Hach child was tested individually, usually 
immediately after his or her interview with the psychiatrist. About 
a dozen cases were discarded because the child was unable to under- 
stand the items or would not cooperate in the testing. Two cases 
were discarded because the children had IQ scores below 70. 

The sample of “normal” children consisted of 117 children who 
attended a publie school which, it was previously determined, had 
а student body similar to the Institute's clinic population in racial 
composition and socioeconomic status. The fifth- and sixth-graders 
in this school were tested in classroom groups. Later, protocols of 
children outside the acceptable age range were discarded. The cases 
of two children with unusually high IQ scores were also discarded 
so that the range of IQ scores would be similar in the normal and 
the clinic samples. The normal children were informed that they 
were participating in a research project and that their answers 
would never become a part of their school records. They were 
urged to be completely frank. Since it was not feasible to do any 
psychiatrie screening of the school children, the school sample is, 
strictly speaking, a non-clinical rather than a “normal” group. In 
this regard it is, however, comparable to the standardization group 
of the CPQ, which likewise consisted mainly of classroom groups. 
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The two samples were comparable in regard to mean age, mean 
1Q, and distribution on sex, race, and socioeconomic level: ¢ and 
xt tests for the significance of differences on these five variables all 
yielded p values exceeding .05. The school sample had a mean age 
of 137.6 months and a mean IQ of 105.2. There were 64 boys and 
52 girls; 92 white children and 25 Negroes; and 37 children whose 
fathers were in white-collar occupations, 69 children of working- 
class origin, and 11 whose father’s occupation was unknown. The 
clinic sample had a mean age of 135.3 months and a mean IQ of 
101.9. There were 72 boys and 38 girls; 78 white children and 32 
Negroes; and 89 children whose fathers were in white-collar oc- 
cupations; 55 children of working-class origin, and 16 whose fa- 
ther's occupation was unknown. While the difference between the 
mean IQ scores of the two samples does not reach statistical sig- 
nificance, it is possible that the normal children were somewhat 
brighter than the clinic children. The IQ scores for the school 
sample were derived from the Henmon-Nelson Tests of Mental 
Ability (HNTMA) administered by the classroom teachers as a 
group test, while the clinic children were nearly all tested indi- 
vidually, usually with the Wechsler Intelligence Scale for Children 


(WISC). 


Procedures 


Both the school and the clinic samples were given Form A of 
the CPQ, 1959 edition. Time limitations precluded the ideal pro- 
cedure of administering both forms. The raw scores were converted 
to stave scores by means of Table 8 of the CPQ Handbook 
(1960). The second-order factors of Anxiety-vs-Adjustment and 
Extraversion-vs-Introversion as well as the Neuroticism Score 
were computed according to the instructions in the Handbook. The 
Index of Idiosynerasy was also computed for each case in ac- 
cordance with the formula described by Pierson and Kelly (1903, 
p. 442). 

Since the diagnostic profile revealed on the CPQ would be ex- 
pected to differ according to the psychopathology of the child, 
clinical ratings and questionnaire scores were utilized to separate 
the clinic sample into broad diagnostic groups. Peterson (1961) 
indicated that several independent factor analytic studies of chil- 
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dren's behavior have produced two major groupings: conduct prob- 
lem, involving the acting out of unsocialized aggression, and 
personality problem, involving excessive inhibition, anxiety, and 
the internalization of conflicts. Peterson's Problem Checklist was 
administered to the mothers of the clinic children. The children 
were assigned to the problem category on which they received the 
highest factor score, provided that the mother’s selection of the 
“three most important problems" was consistent with this classifi- 
cation. Children with equal scores on both factors and children 
whose major problems were not reflected in the highest factor score 
were classified as “other.” In addition, a psychiatrist with over ten 
years of clinical experience and the senior author independently 
utilized the application material in the case record to classify each 
child as having primarily a personality problem, primarily a con- 
duct problem, or some other type of disorder, The final classifica- 
tion for each case was the category for which there was agreement 
by two out of three rating sources. This level of agreement was 

» obtained on 96 per cent of the cases. The four cases on which 
there was no agreement were added to the “other” category. 


Statistical Analysis 


All data processing was done on an IBM 1620 computer. Three 
series of statistical analyses were performed, First, 18 two-way 
analyses of variance were computed in order that the clinic and 
the school samples might be compared on the 14 first-order CPQ 
factors, the two second-order factors (Anxiety and Extraversion- 
Introversion), and the indices of ldiosynerasy and Neuroticism. 
The computations, and all similar ones to be described subse- 
quently, were performed by means of Winer's (1962, p. 241) for- 
mulas for unweighted-means analysis with unequal cell sizes. 

Secondly, since a comparison of the heterogeneous clinie group 
with the school sample might obscure differences that were 
diagnosis-specific, the CPQ trends attributable to diagnosis were 
explored in two ways. The two major clinic subgroups (children 
with personality problems and children with conduct problems) 
were compared with each other by means of 18 two-way analyses 
of variance with sex and type of disorder as the bases of classifica- 
tion. Each of these two subgroups was also separately compared 
with the normal sample by means of a set of 18 analyses of vari- 
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ance with sex and group as the independent variables. It was antic- 
ipated that the normal-versus-personality problem and normal- 
versus-conduct problem comparisons might delineate trends whose 
magnitude was insufficient to appear as statistically significant dif- 
ferences when only the small clinical subgroups were compared 
with each other. The clinie cases in the heterogeneous category of 
“other disorders” were not included in these analyses. 

Thirdly, discriminant function analyses were performed in order 
to obtain CPQ factor weights which would provide maximum dis- 
crimination between the clinic and the normal samples. The dis- 
criminant function analyses were performed by means of a three- 
phase computer program based on formulas presented by Rao 
(1962, pp. 257, 318-19) and Kendall (1957, p. 163). In the first 
phase, the Mahalanobis D? statistic V was used as а chi square 
(with df equal to the number of factors) to test whether the 14 
CPQ factor means for the clinic group differed significantly from 
the means of the normal group. The discriminant coefficients for 
all 14 factors for each of the two groups were then computed. 
Weights to be used as multipliers for each factor stave score were 
obtained by subtracting the coefficients of the normal group from 
the coefficients for the clinic group. The discriminant score for each 
case is the sum of the 14 weighted factor scores plus a constant 
which places the cut-off score at zero. The step-wise program then 
successively eliminated the single factor contributing least to the 
discrimination between groups and re-computed the chi square 
evaluating the significance of the difference between groups until 
only the single factor contributing most remained. The third phase 
of the program classified all cases in terms of their discriminant 
score and indicated the percentage of correct classifications in 
terms of actual group membership. 


Results and Discussion 


Comparison of School and Clinic Samples 


The results of the 18 two-way analyses of variance comparing 
the clinic and the school samples are presented in Table 1. The 
clinic children are less dominant (Factor E), less happy-go-lucky 
(Factor F), more restrained (Factor J), more guilt-prone (Factor 
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TABLE 1 


Results of Eighteen Two-Way Analyses of Variance Comparing CPQ Scores of Clinic 
Versus School Sample and Boys Versus Girls 


Mean F Mean 
Clinie School Boys Girls 
CPQ factor or index N=110 N=117 М =137 N -90;. 

А Sociable—Aloof 2.82 2.89 .01 2.58 3.28  20.15** 
B Intelligent—Dull 2.43 2.64 3.06 2.60 2.44 1.98 
C Ego strength—Weakness 3.15 3.18 .08 3.23 3.07 1.46 
D Excitable—Placid 3.01 3.03 .09 2.84 3.30 9:45** 
E  Dominant—Submissive 2.53 2.97  15.11** 2.97 2.43  19.52** 
Е  Happy-go-lucky—Serious 3.20 3.65 6.33* 3.26 3.69 1.22%® 
G  Conscientious—Frivolous 2.56 2.55 .06 2.40 2.79 6.53* 
Н  Venturesome—Shy 2.64 2.77 1.03 2.74 2.64 4 
I Sensitive—Tough 3.05 2.76 3.07 2.77 3.11 5.40* 
J  Restrained—Vigorous 2.64 2.39 4.56* 2.55 2.44 2 
М  Shrewd—Naive 3.37 3.20 2.27 3.20 3.48. 5.77* 
О  Guilt-prone—Complacent 3.26 2.91 4.51* 3.01 3.19 1.58 
Q: Self-controlled—Lax 3.05 2.92 .17 3.90 2.58 19.67% 
©: Tense—Relaxed 3.11 3.08 .36 3.09 3.10 ; 
Anxiety—Adjustment 30.86 30.26 1.55 29.39 32.31  14.20** 
Extraversion—Introversion 8.65 9.31 4.52* 8.58 9.61  15.80** 
Neuroticism 1.46 —2.21 15.02** —1.29 .88 6.50* 
Idiosyncrasy .86 .82 1.86 .83 .85 

Note,— There is one degree of freedom for each Е computed. 

*p < 05. 

"p < 01. 


^The F for the interaction between sex and classification was 4.51 (р «.05). 


O), and more introverted (Extraversion-Introversion Factor). The 
clinic children also have a significantly higher mean on the sum- 
mary psychopathology index, the Neuroticism Score. While one 
might have expected significant differences on the Anxiety Factor, 
the negative findings are consistent with the warning by Cattell 
and Scheier (1961, pp. 286-287) that the second-order factor of 
Anxiety reflects situational reactions and constitutional factors as 
well as psychopathology. 


School Sample Versus Clinic Sample Subgroups 


Table 2 presents the results of 18 two-way analyses of variance 
comparing the school sample with the personality problem clinical 
subgroup and 18 analyses comparing the school sample with the 
conduct problem subgroup. It is immediately evident that the chil- 
dren with internalized personality problems were responsible for 
nearly all of the significant differences between the clinic and the 
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TABLE 2 
Results of Two Sets of Two-Way Analyses of Variance 
Personality Problem Group Conduct Problem Group 
(N = 37) vs School Sample (N = 44) vs School Sample 
Mean* F F Mean* F F 
PP CP 
CPQ factor or index Group Groups Sex Group Groups Sex 
Sociable—Aloof 2.89 .00  12.85** 2.75 .12 9.96** 
Intelligent —Dull 2.30 3.93* .21 2.48 .51 .21 
Ego strength—Weakness 3.14 .06 1.91 3.32 1.22 .08 
Excitable—Placid 3.00 02 5.57* 2.86 .25 2.08 
Dominant—Submissive 2.24  15.96** 14.18** 2.77 2.35 6.78** 
Happy-go-lucky—Serious 3.41 1.31 4.82* 3.11 2.55 5.78* 
Conscientious—Frivolous 2.41 74 14 2.59 1.96 18,76** 
Venturesome—Shy 2.57 1.27 2.06 2.70 15 1.04 
Sensitive—Tough 3.05 1.54 1.73 3.09 2.91 4.91* 
Restrained— Vigorous 2.84 17:51%* 1.76 2.68 1.92 2.05 
Shrewd—Naive 3.41 1.04 5.13* 3.41 1.50 1.18 
Guilt-prone—Complacent 3.51 6.61* .30 3.16 .96 .56 
_О: Self-controlled—Lax 2.65 1.78 13.99** 3.20 1.81 4.53* 
Qı Tense—Relaxed 3.16 .99 1.39 3.05 .00 .12 
_ Anxiety—Adjustment 32.35 8.85*  11.07** 29.70 .23 1.70 
Extraversion—Introversion 8.86 1.57 .51 8.57 1.11 4.449 
Neuroticism 2.59  12.44** 6.87** .32 1.92 .28 
Idiosyncrasy .87 1.76 1.80 .90 3.90* 65 
*p < .05 
*p < .01 
‘See Table 1 for the school sample means with which the clinic subsample means in this column were compared. 


*The P for the interaction between sex and group was significant (р < .05). 


school samples. Only in regard to Extraversion-Introversion does 
a difference found in the comparison of the school and clinic sam- 
ples fail to emerge in the comparison of the personality problem 
subgroup with the school sample. However, two new significant 
differences emerge: the children with personality problems score 
lower on the Intelligence Factor (B), and higher on the Anxiety 
Factor than the normal children. 

The 44 children with conduct problems differ significantly from 
the normal children only in regard to the Index of Idiosyncrasy, a 
measure of sheer deviation from the mean, regardless of direction. 
In reporting upon a study of 850 delinquent adolescent boys, 
Pierson and Kelly (1963) speculated that the nonconformist im- 
pulses revealed by high scores on the Index of Idiosyncrasy might 
result in acting out that partly serves the purpose of asserting 
individuality. 
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Comparison of Personality and Conduct Problem Subgroups 


When the children with personality problems were compared 
directly with the children with conduct problems by means of 18 
two-way analyses of variance, none of the F values for the vari- 
able of group classifieation reached statistical significance. There- 
fore, the results of the analyses are not presented in tabular form. 

There was no replication of Karson's (1965) finding that Factors 
A, B, D, and I differentiated between boys with personality prob- 
lems and those with conduct disorders. The failure of replication 
might be partly attributable to differences between the two studies 
in regard to sampling and classification procedures. Karson’s sub- 
groups, unlike those used in the present study, contained no 
Negroes and were classified entirely on the basis of clinical ratings. 
Moreover, since the subgroups of the present study had fewer than 
half as many cases as Karson’s subgroups, the subgroup means 
may be less stable and representative. 


Sex Differences 


The analyses of variance presented in Tables 1 and 2 reveal 
extensive sex differences which are greater in magnitude than the 
differences between groups varying in psychopathological status. 
Since the differences occur even on five of the six factors which 
are scaled separately for boys and girls (D, Е, І, М, and Qs), it 
would appear that the stave scores cannot be assumed to control 
for sex differences, 


Neuroticism Score 


As a test of the diagnostic power of the CPQ, the summary 
psychopathology index, the Neuroticism Score, was utilized to clas- 
sify all cases in both the clinic and the school samples. Since there 
was no a priori reason to prefer one type of misclassification over 
another, the cut-off point was chosen in such a way as to equalize 
incorrect classifications of normal and clinic cases. The maximum 
percentage of correct classifications was obtained by using a cut- 
off score of zero, with all cases having scores of 1 or above being 
classified as disturbed. By this procedure, 59.5 per cent of the 
total number of cases are correctly classified, with 61.6 per cent of 
the normal and 57.3 per cent of the clinic cases being correctly 
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identified. There is some support for the construct validity of the 
Neuroticism Score in the fact that 64.9 per cent of the clinic 
children with internalized personality (“neurotic”) problems are 
correctly identified, as compared with 52.3 per cent of the children 
with acting-out, conduct problems. The level of discrimination ob- 
tained is, however, inadequate for the purpose of individual diag- 
nosis. 


Discriminant Function Analyses 


The preceding analyses of data revealed that sex differences were 
more extensive and statistically significant than the differences be- 
tween psychopathological types. Therefore, the discriminant func- 
tion analyses were set up to differentiate only between normal and 
clinic cases without regard to type of disorder, but with the sexes 
analyzed separately. 

The normal and the clinic boys were found to differ significantly 
on the 14 CPQ factors considered in combination (chi square 29.66; 
p < 001). Discriminant scores based upon the factor weights pre- 
sented in Table 3 were found to classify 62.8 per cent of the boys 


TABLE 3 


Weights to be Applied to Stave Scores on the 14 CPQ Factors to Give 
Mazimum Separation of Normal and Disturbed Children. 
SS ЕЕЕ 


Weights for Boys Weights for Girls 
Using Using Using Using 
All Key All Key 

CPQ Factor Factors Factors Factors Factors 
А  Sociable—Aloof —.165 —.184 
B Intelligent—Dull —.274 .182 
C Ego strength—Weakness —.099 .091 

D Excitable—Placid —.169 .392 .346 

Е Dominant—Submissive — .268 —.759 —.651 
Е Happy-go-lucky—Serious —.377 —.373 —.173 

С  Conscientious—Frivolous .398 .393 .224 
Н Venturesome—Shy —.223 —.113 
I Sensitive—Tough .435 .979 —.164 

J  Restrained— Vigorous .253 .408 .488 

N Shrewd—Naive .196 .481 .481 
О Guilt-prone—Complacent .223 .116 

Qs Self-controlled—Lax — .236 575 .497 
©. Tense—Relaxed —.290 .250 

Constant to be added* 2.161 .178 —4.599 —4.404 


*Constant, will establish zero as cut-off point with positive scores representing disturbance. 
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correctly, with 63.8 per cent of the clinic boys and 61.5 per cent of 
the normal boys correctly identified. This level of discrimination 
represents no real improvement upon the 60.6 per cent of correct 
classifications obtainable by means of the Neuroticism Score, par- 
ticularly when one considers the expected shrinkage upon cross- 
validation. 

Successive elimination of the least discriminating factor re- 
vealed that Factors F and I, used as a pair with the special 
weights noted in Table 3, provide the same percentage of correct 
discriminations as the entire set of 14 CPQ scores. The single 
factor contributing most to the discrimination between groups was 
Factor I (Sensitive-Tough). The clinic boys earned the same mean 
score as the clinic and normal girls on this factor, while the normal 
boys earned scores indicative of a tougher, more masculine tem- 
perament. 

The clinic and the normal girls were also found to differ signifi- | 
cantly from each other in regard to mean scores on the 14 CPQ 
factors considered together (chi square 24.56; p < .01). Discrimi- 
nant scores produced by the weights presented in Table 3 classify 
70.0 per cent of the girls correctly with 68.4 per cent of the dis- 
turbed girls and 71.1 per cent of the normal girls correctly identi- 
fied. Again, the level of discrimination is somewhat poor for the 
purpose of individual diagnosis. However, the improvement upon 
the Neuroticism Score (which classifies 57.7 per cent of the girls 
correctly) is sufficiently great that not all of the gain would be 
expected to disappear upon cross-validation. 

Successive eliminations of the least discriminating factors re- 
vealed that Factors D, E, G, J, N, and Qs, used with the special 
weights noted in Table 3, provide as high a percentage of correct 
discriminations as the entire set of 14 weighted CPQ scores. The 
single factor contributing most to the discrimination was E 
(Dominance-Submission) with the disturbed girls being consider- 
ably less dominant. 


Components of Neuroticism 


The results of the discriminant function analyses suggest that 
the CPQ Neuroticism Score has more redundancy than is required 
for the practical purpose of discriminating between normal and 
pathological groups. More important, the findings of the present 
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study, in conjunction with those of Karson (1965), raise question 
as to whether the Neuroticism Score (derived mainly by analogy 
from studies of adults) adequately reflects the factor components 
of neuroticism in children. Low dominance (Factor E) and high 
sensitivity (Factor I) emerged as significant aspects of neuroticism 
in both the earlier and the present study (though the significance 
of Factor I was limited to boys). However, while adult neurotics 
are characterized by low ego strength (Factor C), both Karson 
and the present authors found that their samples of children pre- 
senting a phenotypically neurotic symptom picture (“personality 
problem”) were average on Factor C. Likewise, both Karson and 
the present authors found that the children with internalized per- 
sonality problems were neither significantly more tense (Factor 
Qa) nor less venturesome (Factor H) than normal children, al- 
though adult neurotics are discriminated from normals on these 
factors. These findings are consistent with Langford’s (1964, p. 3) 
observations that disturbed children, in contrast to adults, have a 
greater tendency to release tension by putting feelings into action; 
symptoms representing firmly fixed pathology in an adult may be 
transitory reactions in a child without there being any residue of 
permanent ego damage. 

The findings in regard to the other factors comprising the Neu- 
roticism Score are inconsistent across studies. For example, Kar- 
son’s sample of boys with personality problems earned a sten score 
well within the normal range on Factor O (Guilt-Proneness), 
while both the total clinic sample and the personality problem 
subgroup of the present study had a significantly elevated score on 
Guilt-Proneness. Clearly, additional data are needed on other sam- 
ples of emotionally disturbed children in order to separate sample 
fluctuations from stable, replicable trends. Since a single form of 
the CPQ may provide only minimal diagnostic, discriminating 
power, there is also need for comparative validation studies 
utilizing Forms A and B combined. 


CPQ Standardization Group as Norm Group 

Since information was unavailable regarding the racial, socio- 
economic, and intellectual characteristics of the CPQ standardiza- 
tion group, the present study utilized a school and a clinic sample 
which were known to be comparable on these potentially influential 
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variables. However, it is important to know whether it would have 
been feasible to dispense with the control group and simply com- 
pare the stave score factor means of the clinic sample with the 
stave (standard) score factor means of 3.0 arbitrarily set for the 
standardization group. The school sample was compared with the 
standardization group on all 14 СРО factors by means of t tests 
which treated the standardization group mean as the population 
parameter. The school sample was found to be less intelligent (Fac- 
tor B), more happy-go-lucky (Factor F), less conscientious (Fac- 
tor G), less venturesome (Factor H), tougher (Factor I), more 
vigorous (Factor J), and more shrewd (Factor N) than the 
standardization group. If the clinic sample had been compared with 
the CPQ standardization group, differences attributable to emo- 
tional disturbance would have been inextricably confounded with 
differences (e.g. being less conscientious) shared with normal chil- 
dren having the same racial, intellectual, and socioeconomic char- 
acteristics, 

The school sample had 65.5 per cent children of blue-collar 
workers as compared with 58.1 per cent for the Chicago metropoli- 
tan area and 21.4 per cent Negroes as compared with 16.9 per cent 
for the Chicago metropolitan area (U. S. Bureau of Census, 1962). 
It seems highly likely that the school sample differs from the un- 
known distribution of occupations and races characterizing the 
standardization group in the same way that it differs from the 
Chicago area population, Since previous research with the 16 PF 
has already established distinctive profiles for different occupa- 
tional levels (Cattell and Eber, 1964), further research might re- 
veal that the CPQ is similarly sensitive to socio-cultural differ- 
ences. Meanwhile, it would seem advisable that demographic 
variables be controlled in research on the CPQ. 


Summary 


Form A of the Children’s Personality Questionnaire was ad- 
ministered to 110 child guidance clinic patients and 117 fifth- and 
sixth-graders enrolled in a publie school with a student body simi- 
lar to the clinic population in racial composition and socioeconomic 
level. Two-way analyses of variance computed with sex and group 
as the independent variables indicated that the clinic children 
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were less dominant, less happy-go-lucky, more restrained, more 
guilt-prone, more introverted, and more neurotic than the normal 
children. Both the CPQ Neuroticism Score and the discriminant 
scores resulting from discriminant function analyses differentiated 
only moderately well between the normal and the disturbed chil- 
dren, Other findings of the study indicate the need for further 
investigation of the CPQ factor correlates of neuroticism in chil- 
dren and of possible class differences in response to the CPQ. 
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A METHOD FOR ASSESSING RACIAL ATTITUDES IN 
PRESCHOOL CHILDREN! 


JOHN E. WILLIAMS aw» J. KAREN ROBERSON 
Wake Forest University 


Tue concept of racial attitude is closely identified with certain 
traditional measurement procedures; namely, questionnaire scales 
of the Thurstone, Likert, and Guttman varieties. Research with 
such scales has led to most of the existing knowledge concerning 
racial attitudes among older children and adults. Unfortunately, 
questionnaire scales are not appropriate to the test-taking capabili- 
ties of young children and, thus, are not applicable to the study of 
the development of racial attitudes in the preschool years. This 
limitation has led to the evolution of special research procedures 
for the assessment of racial attitudes and concepts among pre- 
school children, and many imaginative approaches have been made 
(for example: Ammons (1950) doll-play technique; Morland’s 
(1962) picture-interview procedures; Stevenson and Stewarts’ 
(1958) figure discrimination, doll assembly, and incomplete 
stories; Goodman’s (1964) puzzle-interview, pictures and clay in- 
terview, etc.). While the use of such procedures has led to many 
interesting findings, it has been difficult to assess the degree to 
which these procedures tap the same psychological processes 
which are assessed among older persons by the traditional attitude 
scales, In other words, a need has existed for a procedure appro- 
priate to the skills of the preschool child which would yield a 
measure of attitude which could be coordinated with the tradi- 
tional concept of racial attitude. 

1 This study was supported in part by a grant from the Wake Forest College 
Graduate Council. The authors are indebted to the administrators and teach- 


ers of the participating schools for their cooperations, and to Beverly Burch 
for her assistance in the data collection. 


671 


672 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The rationale of the method to be described here had its origins 
in the finding of Osgood and his associates (1957, p. 193) that 
evaluation (E) scores from the semantic differential were highly 
correlated with scores on traditional attitude tests. For example, 
semantic differential E scores for the word Negro were found to 
correlate above .80 with scores obtained on Thurstone’s scale of 
attitude toward the Negro. Since the test-retest reliabilities of the 
measures were .87 in both cases, Osgood and his associates con- 
cluded (1957, p. 194), “. . . whatever the Thurstone scales meas- 
ure, the evaluative factor of the semantic differential measures just 
about as well.” After they had considered data from another 
study in which E scores were found to correlate highly with 
scores on a Guttman type scale, these investigators concluded, 
“The findings of both of these studies support the notion that the 
evaluative factor is an index of attitude” (pp. 194-195). This con- 
clusion suggested that the general rationale of the semantic differ- 
ential might provide a bridge to the measurement of attitudes in 
young children. Such an approach via the assessment of evaluative 
connotations seemed feasible, since many of Osgood’s evaluative 
adjectives were simple words which would be found in the vo- 
cabulary of the preschool child. 

The rationale just described was first employed in a study of 
Caucasian preschool children (Renninger and Williams, 1966) 
aimed at assessing the evaluative connotations of the colors black 
and white.? For this study, the investigators chose four pairs of 
adjectives (good-bad, clean-dirty, happy-sad, nice-awful) which 
had been shown to be heavily weighted on the Evaluation factor 
among adult Ss (Osgood, et al, 1957, p. 37), and which were con- 
sidered likely to be meaningful to children of this age. In the re- 
search procedure, $ was shown a picture of two animals, identical 
in all respects except that one was colored black and the other 
white, and was told a story in which one of the animals was de- 
scribed by one of the eight evaluative words with S being asked to 
indicate which animal was so described, e.g., which is the good 


2 This study is one in a series dealing wi i i f 

р g with th t Ы 

color names, the practice of "color-coding" racial erie cil the posible in- 
em of this practice in the development and maintenance of racial attitudes: 
(952. (1964, 1966), Harbin and Williams (1966), Williams and Carter 
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horse? The three to five year old Ss used the positive and nega- 
tive adjectives in a highly consistent manner responding to good, 
happy, nice, and clean by indicating the white animal, and re- 
sponding to bad, sad, naughty, and dirty by indicating the black 
animal. This finding that positive words tended to cluster about 
the white concept, while negative words tended to cluster about 
the black concept, was similar to earlier semantic differential find- 
ings with young adult Ss (Williams, 1964), and was taken as evi- 
dence that a general evaluation factor, so prominent in the re- 
sponses of adults, was also present in the responses of preschool 
children. It seemed reasonable to conclude that the procedure of 
this study was tapping the child's evaluative predispositions or at- 
titudes toward the colors black and white. 

The present study was undertaken to extend the methodology 
and findings of the study just described. In particular, the investi- 
gation was aimed at: lengthening the evaluation factor procedure 
in an effort to improve its reliability; and exploring the usefulness 
of this procedure in the assessment of racial attitudes. 


Method 


Subjects 


Subjeets were 111 Caucasian children attending one of three 
preschool programs (two church-sponsored, one private) in Win- 
ston-Salem, N. C. Judged informally, the children in all three 
programs appeared to be from better than average socio-economic 
backgrounds. 

Subjects ranged in age from three years, three months, to six 
years, nine months and were grouped by age, as follows: Group I 
consisted of 21 girls and 16 boys, ranging in age from 35-59 
months, with a median age of four years, five months; Group II 
consisted of 23 girls and 14 boys, ranging from 60-68 months in 
age, with a median of five years, four months; Group III con- 
sisted of 21 girls and 16 boys, ranging in age from 69-81 months, 
with a median of six years, zero months. 


Apparatus 


Two sets of materials were used in the study. The first was a 
modification of the Renninger and Williams (1966) picture series 


674 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


for the assessment of the connotative meanings of black and white. 
The second set was a new picture series devised to provide meas- 
ures of: (a) attitude toward dark-skinned (“Negro”) or light- 
skinned (“Caucasian”) persons; (b) awareness of sex-role behav- 
iors (a control measure); and (c) ability to designate racial groups 
by common racial names. 

Revised Color-Meaning Picture Series. The original version of 
this procedure (Renninger and Williams, 1966) provided S with 
eight opportunities to complete stories involving eight evaluative 
adjectives, four positive and four negative, by indicating whether 
a white or a black animal was the one so described. Eight non- 
evaluative stories with pictures in colors other than black and 
white were alternated in the series. The primary purpose of the 
revision was to increase the number of test opportunities from 
eight to twelve in order to decrease the probability of an S making 
а high score by chance. A secondary purpose was to standardize 
size of picture card and background color which had varied un- 
systematically in the original verison. The latter was accomplished 
by making all picture cards of the same size (11 in. X 14 in. )and 
by using background colors of blue or yellow. 

Table 1 summarizes key features of the revised color-meaning 
procedure. The first eight pictures listed were those used in the 
original version; the last four were the new pictures added in the 
revision. In the table, the alternation of filler and test pictures is 
seen, as is the left-right alternation of the black and white figures 
on the six test cards. On the right, are given the key questions 
asked of S after he was told a two or three sentence story to provide 
context. It will be seen that two story questions are given for each 
of the twelve pictures. This is due to the fact that the administra- 
tion procedure consisted of displaying all of the cards to S, and 
then displaying the cards again, in the same order, with different 
stories and, for the six test pictures, with an adjective or reversed 
evaluative meaning. The S thus had a total of twelve test oppor- 
tunities to use the white and black figures to indicate positive or 
negative evaluation. 

The evaluative adjectives used are indicated in italics in the key 
questions of the test cards in Table 1. The positive adjectives 
clean, nice and good, and the negative adjectives dirty, naughty, 
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TABLE 1 


Summary of Color Meaning Procedure, Suborder 1, Describing Picture Content 
and Key Questions for Test (T) and Filler (F) Items 


Picture 
Number Picture Content . Story Number and Key Question 
and Type (Left) (Right) (“Which one. . .") 
1 Brown plane Green plane 1. ...got caught up in the tree? 
(F) (yellow background) 13. ...would fly the fastest? 
2 White dog Black dog 2. ...is the clean doggy? 
(T) (yellow background) 14, ...is the bad doggy? 
3 Brown phone Red phone 3. ...of these would you like to have? 
(F) (yellow background) 15. ...would your mommy choose? 
4 Black Teddy White Teddy 4. ...is the dirty teddy bear? 
(T) Bear Bear 16. ...is the nice teddy bear? 
(yellow background) 
5 Red Wagon Blue Wagon 5. ...of these will go the fastest? 
(F) (yellow background) 17. . would be safe for the baby? 
6 Black Kitten White Kitten 6. ...is the pretty kitty? 
(T) (blue background) 18. ...is the naughty kitty? 
7 Green Top Orange Top 7. ...is the broken top? 
(F) (blue background) 19. ...will spin the fastest? 
8 White Rabbit Black Rabbit 8. ...is the mean rabbit? 
(T) (blue background) 20. .is the smart rabbit? Й 
9 Brown Gray 9. ...was caught by the girl? 
(F) Butterfly Butterfly 21. ...did Susie see in the garden? 
(blue background) 
10 White Horse Black Horse 10. ...is the good horse? 
(T) (blue background) 22. ...isthe ugly horse? 
11 GreenSkooter Red Skooter 11. ...do you think Billy will choose? 
(F) (blue background) 23. ...would you pick? 
12 Black Cow White Cow 12. ...is the stupid cow? 
(T) (yellow background) 24. ...is the kind cow? 


and bad had been used in the original version. To these were 
added the positive adjectives pretty, smart, and kind, and the 
negative adjectives ugly, stupid, and mean. The adjectives happy 
and sad from the original test were dropped in the revision. 

The order of pictures displayed in Table 1, referred to as sub- 
order СМІ, was that administered to approximately half of the Ss. 
A second order, CM2, was administered to the other Ss and con- 
sisted of a reversal of the first order, except that this second order 
also began with a filler picture. Suborder CM2, thus, consisted of 
the stories in the order 11, 12, 9, 10, 7, 8, etc., 23, 24, 21, 22, 19, 20, etc. 

Racial-Attitude Sex-Role Picture Series. This procedure, devel- 
oped specifically for this study, was designed along the same gen- 
eral lines as the revised color-meaning procedure. In this proce- 
dure, racial-attitude test cards and evaluative stories occupied even 
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TABLE 2 
Summary of Racial-Attitude (RA) and Sex-Role (SR) Procedure: Suborder 1 
Picture Story Number and Key j, 
Number Picture Content Question Racial Identification 
and Type (Left) — (Right) (“Which оре...) Questions 
1 Cauc. Cauc. 
(SR) Girl Boy 
2 Negro Cauc. 
(RA) Girl Girl 
3 Negro Negro 
(SR) Girl Boy 
4 Сале, Negro 
(RA) Boy Boy 
5 Negro Negro 
(SR) Teenage Teenage 
Girl Boy 
s 6 Cauc. Negro 
(RA) Teenage Teenage 
Girl Girl 
7 Cauc. Cauc. 
(SR) Teenage Teenage 
Girl Boy 
8 Negro Cauc. 
(RA) Teenage Teenage 
Boy Boy 
9 Cauc. Саце, 
(SR) Woman Man 
10 Cauc. Negro 
(RA) Woman Woman 
11 Negro Negro 
(SR) Woman Man 
12 Negro Cauc. 
(RA) Man Man 


1. Which child likes to play 
with dolls? 
13. Which one likes to play 
cowboy and Indians? 


2. ...istheclean little girl? 1. pue is the white 
girl? 
14. ...the bad little girl? 7. WhichistheNegro 
girl? 
3. ...wants to be a police- 
man? 
15. ...likes to dress up in 
their mother's clothes. 
4. ...isthedirtylittleboy2 2. ...Negro...? 
16. ...isthe nice little boy? 8. ...colored...? 
5. ...likesto wear lipstick? 
17. ...likes to play football? 
6. ...is the pretty girl? 8. ...colored...? 
18. ...is the naughty girl? 9. ...white...? 
7. ...works after school 
mowing lawns, etc.? 
19. ...likes to buy new 
dresses? 
8. ...is the mean boy? 4. ...white...? 
20. ...is the smart boy? 10. ...Negro...? 
9. ...washes the dishes? 
21. ...built the barn? 
10. ...is the good woman? 5. ...Negro...? 
22. ...is the ugly woman? 1l. ...colored...? 
ll. ...can fix a car? 
23. ...cooked the pie? 
12. ...is the stupid man? 6. ...colored...? 
24, ...is the kind man? 12, ...white...? 


numbered positions, while the odd numbered (“filler”) positions 
were occupied by sex-role items. 

Table 2 summarizes the key features of this procedure. Each of 
the twelve stimulus cards consisted of two full-length drawings of 
human figures varying from 414 to 8 inches tall, displayed side by 
side on a 9 X 11 card. The racial-attitude test cards (even num- 
bers in Table 2) displayed two figures which were identical except 
for hair and skin color; one figure (“Caucasian”) had light yellow 
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hair and pinkish-tan skin, while the other figure ("Negro") had 
black hair with medium-brown skin The figures were drawn 
with minimal facial characteristics and were posed in "neutral" 
standing, walking, or sitting positions on a plain white background. 
Tt can be seen on the left in Table 2 that the age level of the per- 
sons depicted varied from young boys and girls, to teenage boys 
and girls, to adult men and women. 

The six sex-role pictures are odd-numbered in Table 2. Each 
picture consisted of two figures, one male and one female, of the 
same age level. Both figures had the same hair and skin color and 
were depicted with minimal facial characteristics in standing or 
walking postures. 

The twelve pictures in the combined racial-attitude sex-role pro- 
cedure were administered twice, in the same order, but with dif- 
ferent stories told each time for a given picture. The adjectives 
used in the key questions of the racial-attitude test stories were 
the same twelve evaluative adjectives employed in the revised 
color-meaning test. 3 

The order of pictures displayed in Table 2, referred to as sub- 
order RA1, was that administered to approximately half the Ss. A. 
second suborder RA2, in which the young to old progression of 
figures was reversed, was administered to the other Ss. This was 
done by administering the stories in the order 9-12, 5-8, 1-4, 21- 


24, 17-20, 13-16. 


Procedure 


Experimenters were two female, Caucasian, college students. 
Each E was introduced to the class by the teacher as someone who 
had a game to play with them. The procedure was administered in 
a private room which contained a low table and two small chairs. 
Each of the 7-10 minute initial sessions began with E asking S his 
full name and age, and chatting with the child for a few moments 
as he became accustomed to the situation. E sat directly across the 


3Tt should be noted that the figures called Caucasian and Negro were dis- 
tinguished only by hair and skin color with no efiort being made to represent 
other racial features. The use of the racial designation is supported by data 
reported below which indicate that Ss easily associated common racial labels 


with the lighter and darker figures. 
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TABLE 2 


Summary of Racial-Attitude (RA) and Sez-Role (SR) Procedure: Suborder 1 


Picture Story Number and Key 
Number Picture Content Question Racial Identification 
and Type (Left) — (Right) (“Which оре...) Questions 
1 Cauc. Cauc. 1. Which child likes to play 
(SR) Girl Boy with dolls? 
13. Which one likes to play 
cowboy and Indians? 
2 Negro Cauc. 2. .. „із the clean little girl? 1. Which is the white 
(RA) Girl Girl girl? 
14. ...the bad little girl? 7. Whichis the Negro 
girl? 
3 Negro Negro 3. ...wants to be a police- 
(SR) Girl Boy man? 
15. ...likes to dress up in 
ў i; г their mother's clothes. 
auc, egro 4. ...isthedirtylittleboy2 2. ...Negro...? 
s RON Boy 16. ...is the nice little boy? 8. ...colored...? 
EM s Ec 
e age m play foot! 
Girl Boy 
` (RA) Fen шо 6. ...is the pretty girl? 8. ...colored...? 
a aoe 18. ...is the naughty girl? 9. ...white...? 
7 Сале. Cauc. 7. ...works after school 
(SR) Teenage Teenage mowing lawns, etc.? 
Girl Boy 19. ...likes to buy new 
dresses? 
dui PER up а -is the mean boy? 4. ...white...? 
Boy ge RI 0. ...is the smart boy? 10. ...Negro...? 
9 Cauce. Cauc. 9. washes the dish 
Я us ез? 
bu фон Mas at - - built the barn? 
J egro - -is the good woman? 5. ...Negro...? 
(RA) Woman Woman 22. | is the lywoman? 11. .. ный. st 
(SR) Negro Мерто 11 can fre car? 
Woman i 
12 Bm Gain 26 à inis the pie? 
(RA) as . 13 the stupid man? буа 


numbered positions, while the odd numbered (^filler") positions 


were occupied by sex-role items. 


as 2 Summarizes the key features of this procedure. Each of 

e twelve stimulus cards consisted of two full-length drawings of 
human figures varying from 4% to 8 inches tall, displayed side by 
side on а 9 X 11 card. The racial-attitude test cards (even num- 
bers in Table 2) displayed two figures which were identical except 
for hair and skin color; one figure ("Caucasian") had light yellow 
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hair and pinkish-tan skin, while the other figure ("Negro") had 
black hair with medium-brown skin.? The figures were drawn 
with minimal facial characteristics and were posed in “neutral” 
standing, walking, or sitting positions on a plain white background. 
It can be seen on the left in Table 2 that the age level of the per- 
sons depicted varied from young boys and girls, to teenage boys 
and girls, to adult men and women. 

The six sex-role pictures are odd-numbered in Table 2. Each 
picture consisted of two figures, one male and one female, of the 
same age level. Both figures had the same hair and skin color and 
were depicted with minimal facial characteristics in standing or 
walking postures. 

The twelve pictures in the combined racial-attitude sex-role pro- 
cedure were administered twice, in the same order, but with dif- 
ferent stories told each time for a given picture. The adjectives 
used in the key questions of the racial-attitude test stories were 
the same twelve evaluative adjectives employed in the revised 
color-meaning test. 4 

The order of pictures displayed in Table 2, referred to as sub- 
order RA1, was that administered to approximately half the Ss. A 
second suborder RA2, in which the young to old progression of 
figures was reversed, was administered to the other Ss. This was 
done by administering the stories in the order 9-12, 5-8, 1-4, 21- 


24, 17-20, 18-16. 


Procedure 


Experimenters were two female, Caucasian, college students. 
Each E was introduced to the class by the teacher as someone who 
had a game to play with them. The procedure was administered in 
a private room which contained a low table and two small chairs. 
Each of the 7-10 minute initial sessions began with E asking S his 
full name and age, and chatting with the child for a few moments 
as he became accustomed to the situation. E sat directly across the 


3]t should be noted that the figures called Caucasian and Negro were dis- 
tinguished only by hair and skin color with no effort being made to represent 
other racial features. The use of the racial designation is supported by data 
reported below which indicate that Ss easily associated common racial labels 


with the lighter and darker figures. 
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table from S and held the pictures in a vertical position midway 
between herself and S. The mimeographed form on which E re- 
corded S's responses was kept behind the pictures in such a man- 
ner that S could not observe E's actions. 

Each S was tested on two occasions with an interval of one to 
three weeks (median = 14 days) between sessions. Fifty-five Ss 
took the color-meaning procedure at the first session and the ra- 
cial-attitude sex-role procedure at the second session. (Order CM/ 
ВА); while 56 Ss took the racial-attitude sex-role test first and the 
color-meaning test second. (Order RA/CM). An effort was made 
also to counter balance the CM1 and CM2 suborders of the color- 
meaning procedure and the RA1 and RA2 suborders of the ra- 
cial-attitude sex-role test across the major CM/RA, RA/CM or- 
ders. While this effort, was not completely successful, this does not 
seem of consequence since no differential effect was found for 
CM1 vs CM2, or RAI vs RA2. 

Color-Meaning Procedure. When the color-meaning procedure 
was administered first, the instructions of Renninger and Williams 
(1966) were used: “What I have here are some pictures I’d like to 
show you and tell you stories about, and I’d like for you to help 
me by finishing every story the way you think it should end. I'll 
show you what I mean." S was then shown the series of 12 picture 
cards in the order prescribed. At the end of each story, the child 
was asked the key question for that story (seen in Table 1) and a 
record was made of which of the two figures he indicated. When 
all twelve of the pictures had been displayed, S was told, “Now 
let's look at the pictures again with some different stories,” and S 
was shown the pictures again in the same order. Thus, S responded 
to each of the 6 test pictures twice, providing 12 responses to the 
black and white figures. When the test was concluded S was asked 
to treat the procedure as a secret and allowed to go back to his 
class. 

When the color-meaning procedure was the test assigned to S 
for the second session, S was told “You remember last time we 
looked at the pictures of people. Well, this time I have something 
different to show you. The pictures I have today are of animals 
and toys and things like that. You know the difference between 
animals and people, don't you? Now, with these pictures I want 
you to help me by finishing every story the way you think it 
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should end." E then read the first story and followed the proce- 
dure outlined above. After the color-meaning procedure was com- 
pleted, the racial identification procedure (described below) was 
administered, S was reminded to keep things secret, and returned 
to his class. 

Racial-Attitude Sex-Role Procedure. When the racial-attitude 
sex-role procedure was administered in the first session, S was told, 
“What I have here are some pictures I’d like to show you and 
stories that go with each one. I want you to help me by pointing 
to the person in each picture that the story is about. Here, I'll 
show you what I mean.” E then read the first story ending with 
the key question (see Table 2) and recorded S's choice of the two 
figures. This same procedure was followed until each of the 12 
pictures had been displayed. E then said, “Now let/s look at the 
pictures again, but this time with some different stories.” The 
twelve pictures were then displayed again with S’s response re- 
corded to the 12 different stories and key questions. Thus, in all 
S responsed to each of the six racial-attitude pictures twice, and to ` 
each of the six sex-role pictures twice. 

When the racial-attitude sex-role procedure was administered 
during the second session, S was told, “You remember last time I 
showed you some pictures of animals and toys and things like 
that. Well, this time the pictures I have are very different. The 
pictures I have today are pictures of people. You know the differ- 
ence between animals and people, don’t you? Now, I'll read you a 
story and I want you to help me by pointing to the person in the 
picture that the story is about. Okay?” E then administered the 
procedure as described above. Following this, the racial identifica- 
tion procedure was administered. 

Racial Identification Procedure. A short procedure to assess S's 
tendency to identify lighter and darker persons by common racial 
labels was administered at the end of the second session. This was 
done using the six racial-attitude pictures from the racial-attitude 
sex-role procedure. S was told, “Now let’s look at these pictures 
again. I want to ask you a few questions about the type of people 
in some of these pictures.” S was then shown each of the six pic- 
tures, twice, being asked to indicate the “Negro” person, or the 
“colored” person, or the “white” person in each picture. As indi- 
cated in the right hand column of Table 2, 8 was given four op- 
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portunities to demonstrate his knowledge of each of the three 
racial terms. 


Results 


The principal data for analysis consisted of three measures for 
each S: (a) the black-white color meaning score obtained by 
counting the number of times S indicated the black animal when 
a negative adjective was used in the story, plus the number of 
times S indicated the white animal when a positive adjective was 
used; (b) the racial-attitude score obtained by counting the num- 
ber of times S indicated the black-haired, brown-skinned person 
when a negative adjective was used, plus the number of times 8 
indicated the yellow-haired, pinkish-tan-skinned, figure when a 
positive adjective was used; and (с) the sex-role score obtained by 
counting the number of times S indicated the male figure when a 
masculine activity was mentioned, plus the number of times S in- 
dicated the female figure when a feminine activity was mentioned. 
Since each of the three measures was based on 12 response oppor- 
tunities, the possible range of scores for each was 0 to 12, with six 
representing a neutral mid-point where S was displaying no con- 
sistency in his responses. 

Preliminary analyses to assess the effect of the following varia- 
bles yielded negative results: (a) order of presentation (1.е., CM/ 
RA vs RA/CM); (b) suborder of presentation within the color- 
meaning, and racial-attitude sex-role procedure; (c) examiner dif- 
ferences; and (d) data collected later in the study vs earlier (to see 
if later Ss had been “tipped off” by earlier Ss) and (e) sex of sub- 
ject. 


Frequency Distributions of Scores 


Table 3 presents frequency distributions of color-meaning, ra- 
cial-attitude, and sex-role scores in each of the three age groups 
along with the frequency which would be expected if Ss were re- 
sponding solely by chance, i.e., following the binomial distribu- 
tion. For each of the three scores at each of the three age levels, 
а x? test indicated a significant (p < .001) departure of the ob- 
tained distribution from the expected (chance) distribution. 
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TABLE 3 


Distributions of Expected (Chance) and Obtained Color-Meaning, Racial 
Attitude, and Sez-Role Scores for Age Groups I, II, 


and 111 of 87 Ss Each 
Color Meaning Racial Attitude Sex Role 
Expected —— -—— 

Score Frequency ISP ID Ua T o ID TEE Den LUSE, 
0 .01 T 0 0 0 0 0 0 0 0 
1 op il 0 0 0 0 0 0 0 0 0 
2 .74 1 ОО 0 0 0 090, 0 
3 1.85 0 0 0 0 0 0 0 0 0 
4 4.44 0 0 0 0 0 0 0 0 0 
5 7.03 3 0 L 0 Tm 0 0 0 0 
6 8.51 3 1 1 2 2 0 1 1 0 
7 7.03 3 2 1 3 0 1 1 0 0 
8 4.44 17, 4 1 6 0 1 2 1 0 
9 1.85 3 3 3 4 3 1 3 T 2 

10 WAS 4 3 6 5 7 2 10 6 1 
11 ll 3 5 7 9 5 9 7 14 6 
12 .01 9 19 #17 8 19 23 13 14 28 


An inspection of the distributions of the color-meaning scores 
in Table 3 indicated that Ss appeared to differ primarily in the 
degree to which black was viewed negatively and white posi- 
tively; only two of 111 Ss showed a clear tendency toward a re- 
versal of these meanings. Likewise, the racial-attitude data indi- 
cated differing degrees of negative evaluation of the darker person 
and positive evaluation of the lighter person with no S displaying 
a clear tendency to reverse these meanings. As expected, the sex- 
role scores showed varying degrees of knowledge of appropriate 
sex-role behaviors with no Ss showing a tendency toward a re- 
versal of conventional sex roles. 


Scores Classified by Consistency of Response 


As was just noted, the actual distributions of scores for each of 
the three measures were asymmetrical, falling largely in the upper 
half of the possible 0-12 score range. For this reason, the scores 
were classified for further analysis as follows: scores of 11 and 12 
(binomial chance probabilities of .003 and .0002, respectively) 
were designated high consistency of response; scores of 9 and 10 
(binomial chance probabilities of .05 and .02) were designated 
medium consistency of response; and scores of eight or less (bi- 
nomial chance probabilities > .12) were designated as no con- 
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Per Cenf 


I п ш I п ш E п ш 
Color Meaning Racial Attitude Sex Role 
Figure 1. Per cent of 37 Ss in each of three age groups (I, II, and III) dis- 


playing high (H) and medium (M) response consistency in ‘color-meaning, 
racial attitude, and sex-role scores. 


sistency of response. Figure 1 presents the per cent of Ss at each 
age level showing high and medium response consistency on each 
of the three measures. The figure is arranged so that the high and 
medium per cents are combined into a total per cent representing 
those Ss (scores of 9-12) who displayed a degree of response con- 
sistency not attributable to chance (р < .05). 

At the left in Figure 1, it can be seen that the total per cent of 
Ss showing a significant degree of consistency in color-meaning 
increased regularly across the age groups from 51.3 of Group I, to 
81.1 of Group II to 89.2 of Group III. Statistical tests revealed that 
the increase from Group I to Group II was statistically significant 
(р < .01), while that from II to III was not. 

The per cent of Ss in each age group revealing high and me- 
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dium consistency on the racial-attitude measure are shown in the 
center portion of Figure 1. Here it can be seen that the total per 
cent increased regularly from 70.2 in Group I to 91.9 in Group II, 
to 94.6 in Group III with the increase from I to II being statistically 
significant (p < .01) while the increase from II to III was not. 
However, a test of the II to III increase in per cent of high con- 
sistency Ss only (from 64.9 to 86.5) was significant (p < .02). 

At the right side of Figure 1 are given the per cents of Ss classi- 
fied as having high and medium sex-role consistency for each of 
the three age groups. Due to the fact that the total consistency per 
cents were approaching a maximum, the differences in the ob- 
served values of 89.1 in Group I, 94.6 in Group II, and 100.0 in 
Group III were not statistically significant. If one considers only 
the high consistency Ss however, the increase from 54.0 in Group 
I to 75.7 in Group II is significant (p < .03), and the increase 
from 75.7 in Group II to 91.9 in Group III is also significant (p < 
.03). 


Relative Development of Concepts 


The data were also examined for evidence concerning the rela- 
tive development of the concepts being studied. In Figure 1, it can 
be seen that in each age group the lowest degree of consistency 
was obtained for color meaning (CM), a higher degree of con- 
sistency on racial attitude (RA), and the highest degree of con- 
sistency on sex role (SR). In Group I: the CM consistent per cent 
of 51.3 was significantly lower than the RA consistent per cent of 
70.2 (р < .05), and the SR consistent per cent of 89. 1(p < 01); 
the RA consistent per cent was also significantly lower than the 
SR consistent per cent (p < .03). In Group II: the difference be- 
tween the CM consistent per cent of 81.1 and the RA consistent 
per cent of 91.9 was of only borderline significance (p < .09), 
while the difference between the CM per cent and the SR per cent 
(94.6) was significant (p < .05) ; the difference between the RA and 
SR per cents was not significant. In Group III, none of the ob- 
served differences in per cents of pooled high-medium consistent 
Ss were statistically significant. However, if one considers only 
the high consistency Ss, the difference between the 64.9 per cent of 
Ss highly consistent on CM and the 86.5 per cent of Ss highly con- 
sistent on RA was significant (p « .02). 
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Racial Identification Data 


The racial identification procedure was designed to determine 
whether the child had learned to apply commonly used racial 
names to designate persons differing in hair and skin color. In the 
procedure, S had four opportunities each to designate the “white” 
person, the “colored” person, and the “Negro” person. Recog- 
nizing that some children might be familiar with the term colored 
and not the term Negro, or vice versa, S was classified as being 
"aware" of racial labels if he indicated: (a) the pinkish-tan figure 
each of the four times "white" was used, and (b) in- 
dicated the brown figure each of the four times “colored” was 
used, and/or (c) the brown figure each of the four times “Negro” 
was used (ie., 4+ white, 4+ colored and/or 4+ Negro). Using 
this scheme the probability was < .01 for an S who was respond- 
ing by chance to be judged to be aware of the use of racial labels. 
The per cent of Ss in each age group classified as being aware of 
racial labels was 45.9 in Group I, 81.1 in Group II, and 97.3 in 
Group III. Statistical tests based on these percentages indicated 
that the I to II increase was statistically significant (p < .01), as 
was the II to III increase (р < .02). 


Item Analyses 


It was considered of interest to make an item-by-item examina- 
tion of the color-meaning and racial-attitude procedures to de- 
termine whether the different adjectives selected to represent the 
positive and negative ends of the evaluation dimension did in fact 
do so. Table 4 provides a summary of the responses of all Ss to 
each of the 12 adjectives used in both the color-meaning and ra- 
cial-attitude procedures, 

From these data, it is clear that the adjectives did fall into two 
discrete groups corresponding to their a priori classification as pos- 
itive or negative. In addition, it can be seen that there was little 
variation among the adjectives within the positive and negative 
groups, indicating that each adjective made an approximately 
equal contribution to the total color-meaning or racial-attitude 
score. The data in Table 4 also reveal that, for eleven of the 
twelve adjectives, the degree of association of adjectives with the 
figures differing in hair and skin color was more extreme than 


WILLIAMS AND ROBERSON 685 


TABLE 4 


Per Cents of All 111 Ss Responding to Adjectives by Indicating, 
White or Black Figure on the Color Meaning Procedure, 
and by Indicating Pinkish-Tan or Brown Figure on 


ihe Racial Attitude Procedure 
Color Meaning Racial Attitude 

Evaluative 

Adjective % White % Black % Pinkish-Tan % Brown 
Pretty 87.4 12.6 94.6 5.4 
Clean* 84.7 15.3 94.6 5.4 
Nice* 83.8 16.2 88.3 11.7 
Smart 82.0 18.0 79.3 20.7 
Good* 81.1 18.9 90.1 9.9 
Kind 75.7 24.3 82.9 17.1 
Ugly 17.1 82.9 7.2 92.8 
Dirty* 16.2 83.8 5.4 94.6 
Naughty* e 18.0 82.0 13.5 86.5 
Stupid 18.0 82.0 17.1 82.9 
Bad* 15.3 85.7 9.9 90.1 
Mean 9.9 90.1 17.1 82.9 


“Adjectives used in color-meaning procedure of Renninger and Williams (1966). 


was the association of the adjectives with the colors black and 
white. This is a further reflection of the finding, noted above, that 
the per cent of Ss with high racial-attitude scores exceeded the 
per cent of Ss with high color-meaning scores at all age levels. It 
can also be seen in Table 4 that Ss appear to have responded to the 
adjectives added in the revision of the color-meaning test in much 
the same way as to the adjectives from the original version. 

An analysis of Ss’ responses to the individual sex-role items re- 
vealed that eleven of the twelve items were of comparable diffi- 
culty ranging, for the total sample of Ss, from 91.0 per cent to 
95.5 per cent correct, ie., sex-role appropriate. The one deviant 
item, responded to correctly only 64.9 per cent of the time, con- 
cerned after-school work activities of teenagers and was appar- 
ently ambiguous to many Ss. 

An analysis of the racial identification items revealed the fol- 
lowing: the “white” person was indicated as the yellow-haired, 
pinkish-tan skinned figure 96.2 per cent of the time; the “colored” 
person was indicated as the black-haired, brown-skinned figure 
82.4 per cent of the time; the “Negro” person was identified as the 
black-haired, brown-skinned person 69.4 per cent of the time. 
From these data, it would appear that these Caucasian preschool- 
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ers are more familiar with “colored” than with “Negro” as a des- 

ignation for dark persons, and that neither of these terms is as well 

learned as is the term “white” as a designation for lighter persons. 
Discussion 

The color-meaning and racial-attitude data indieated that the at- 
tempt to improve the reliability of the basic procedure by the use 
of additional evaluative words was successful. The three new pairs 
of evaluative adjectives were responded to in the same fashion as 
the three pairs from the original version. The greater range of 
scores provided by the revised procedure makes possible a more 
sensitive appraisal of the degree of response consistency present in 
a given subject, or, viewed differently, makes it extremely unlikely 
that a subject will be classified as demonstrating response con- 
sistency when he is merely guessing. 

The racial-attitude data suggest that the evaluation factor ra- 
tionale holds promise in providing a racial-attitude measure for 
children which can be coordinated, via the semantic differential, to 
traditional attitude measures in adults. The employment of this 
approach should facilitate the developmental study of attitudes by 
removing the previous discontinuity between child and adult meas- 
ures. One should be cautious, of course, in assuming that the hu- 
man figures of the racial-attitude pictures were necessarily seen as 
different racial groups by the preschool Ss, since the figures dif- 
fered only in skin and hair color with no attempt to represent 
other racial features. On the other hand, the racial identification 
data indicated that, when Ss were asked to identify the “white,” 
“colored,” and “Negro” figures, most were able to do so on the 
basis of the skin- and hair-color cues, 

It was surprising to observe the frequency with which these 
predominantly middle-class children obtained high scores on the 
racial-attitude measure; in the oldest age group, 86.5 per cent of 
the children scored in the prejudiced direction 11 or 12 times out 
of 12 opportunities Do these extreme scores signify only con- 
sistency of attitude or do they also reflect attitude intensity? The 
conservative answer would be that the higher the score, the more 


4To put this in another, more startling, way: these preschoolers knew thei 
prejudice about as well as they knew their sex-roles! Fas 
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consistent the negative attitude toward dark persons since, opera- 
tionally, the high scores were obtained by responding to six differ- 
ent pairs of evaluative adjectives in the prejudiced direction, 1.е., 
the brown-skinned figures were not merely “bad,” they were also 
“ugly,” “dirty,” “naughty,” “stupid,” and “mean.” Logically, one 
can conceive of the child who obtains a high score not because of 
any very strong negative feelings about dark-skinned persons, but 
because such persons are viewed consistently as a “little bad,” a 
“little ugly,” a “little dirty,” etc. On the other hand, consistency of 
negative attitude appears often to be used as an index of attitude 
intensity, as in the case of attitude scales which are scored by 
summing the number of different items to which S responds in a 
negative manner. Thus, the consistency vs intensity question re- 
garding the attitude scores of this study seems only to point up 
one of the unresolved problems in attitude measurement. It might 
be possible to illuminate this problem somewhat by modifying the 
procedure of the present study, as follows: after a story containing 
the evaluative words has been told, and after the child has indi- 
cated, for example, which is the “bad man,” the child might be 
asked the further question, “Is he just bad, or is he very bad?” One 
could then determine whether children who display high con- 
sistency of negative attitude also make more frequent use of the 
intensifying adjective. Even if this were found not to be во, the 
modified procedure might be useful in yielding a more discriminat- 
ing measure of racial attitude based on both consistency and in- 
tensity of response. 

The data pertaining to the relative development of racial atti- 
tude and color meanings suggest that the two are developing con- 
currently, with the former developing slightly ahead of the latter. 
This result does not support the earlier hypothesis (Renninger and 
Williams, 1966) that the black-white color meanings are learned 
first, and provide a frame of reference for the learning of evalua- 
tive responses to racial groups designated as “black” and “white.” 
The present findings are more consistent with the view that the 
color-meaning factor acts as a contributing or reinforcing factor 
in the early development of racial prejudice. A study in progress 
may provide further clarification of the relation of color meanings 
to racial attitude. In this study, children are being trained, by 
operant learning procedures to reverse the customary connotations 
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of black as bad and white as good. Subsequently, the effect of this 
training upon the children's evaluations of Caucasian and Negro 
figures will be assessed. 

It would seem that the evaluation factor methodology of this 
study might be applicable to the assessment of other non-racial 
attitudes. Stated most generally, the method might be used to as- 
sess attitude toward any general concept: (a) which can be repre- 
sented pictorially; (b) for which a number of different instances 
(1.е., pictures) can be provided; and (c) about which a meaningful 
Story containing positive and negative evaluation words can be 
told. By constructing appropriate sets of stimulus pictures, one 
could study preschool children's attitudes towards: physically nor- 
mal vs deformed persons; males vs females; uniformed (authority?) 
men vs civilians, etc. Similarly, by keeping the characteristics of 
the human figures constant and by varying background features of 
the pictures, one might study the manner in which indicators of 
socioeconomic class influence the child’s attitude toward persons. 

Future research efforts might also explore the possible value of 
eliminating the “either-or” character of the present procedures by 
presenting S with more than two figures per card from which to 
choose in making his evaluative response. For example, modifying 
the racial-attitude pictures to display one pinkish-tan figure, one 
light-brown figure, and one dark-brown figure might provide evi- 
dence as to whether the negative evaluation of the darker figure 
due to his being “non-Caucasian” (i.e., “not-like-me’’), or is more 
directly related to the darkness of the figure, per se. 


Summary 


This study was aimed at developing a method for the assessment 
of racial attitudes in young children which could be coordinated 
with traditional attitude scales via the rationale of the semantic 
differential. One hundred and eleven Caucasian preschool children 
were administered picture-story procedures to assess: (a) the eval- 
uative connotations of Negro and Caucasian figures; (b) the eval- 
uative connotations of the colors black and white; and (c) the 
child’s knowledge of sex-role behaviors (a control measure). 

It was concluded that the method employed represents a promis- 
ing approach to the measurement of racial attitudes, and that the 
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same method might be used in the assessment of non-racial atti- 
tudes in young children. 


REFERENCES 


Ammons, R. B. Reactions in a Projective Doll-Play Interview of 
White Males Two- to Six-Years of Age to Differences in Skin 
Color and Facial Features. Journal of Genetic Psychology, 
1950, 76, 323-341. 

Goodman, M. E. Race Awareness in Young Children. New York: 
Collier Books, 1964. 

Harbin, S. P. and Williams, J. E. Conditioning of Color Connota- 
tions. Perceptual and Motor Skills, 1966, 22, 217-218. 

Morland, J. K. Racial Acceptance and Preference of Nursery 
School Children in a Southern City. Merrill-Palmer Quarterly, 
1962, 8, 271-280. 

Osgood, C. E., Suci, G. J., and Tannenbaum, P. H. The Measure- 
ment of Mearting. Urbana: Univer. of Illinois Press, 1957. 

Renninger, C. A. and Williams, J. E. Black-White Color Connota- 
tions and Race Awareness in Preschool Children. Perceptual 
and Motor Skills, 1966, 22, 771-785. 

Stevenson, H. W. and Stewart, E. C. A Developmental Study of 
Racial Awareness in Young Children. Child Development, 
1958, 29, 399-409. 

Williams, J. E. Connotations of Color Names Among Negroes and 
Caucasians, Perceptual and Motor Skills, 1964, 18, 721-731. 
Williams, J. E. Connotations of Racial Concepts and Color Names. 
Journal of Personality and Social Psychology, 1966, 3, 531-540. 
Williams, J. E. and Carter, D. J. Connotations of Racial Concepts 
and Color Names in Germany. Journal of Social Psychology, 

1967, 72, 19-26. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1967, 27, 691-697. 


ACQUIESCENCE AND SOCIAL DESIRABILITY 
RESPONSE SETS AND 
SOME PERSONALITY CORRELATES 


DARHL M. PEDERSEN 
Brigham Young University 


Тн responseseof an examinee to a personality measure may in- 
volve response sets as well as responses to the content of the items 
(Cronbach, 1946, 1950; Edwards, 1957). Response sets may be a 
source of (a) error variance—they may be confounded with legiti- 
mate replies to the content of an item or (b) reliable variance— 
they may reflect reliable personality traits or stylistic tendencies 
on the part of the subject (Berg, 1955; Frederiksen and Messick, 
1959; Jackson and Messick, 1958). Research has indicated that 
two response sets of major importance are the acquiescent re- 
sponse set and the social desirability response set (cf. Couch and 
Keniston, 1960; Jackson and Messick, 1960; Wiggins and Rumrill, 
1959). There are two approaches to dealing with response sets: 
(a) reduction of the extent to which they are permitted to operate 
by controlling ‘test item form and (b) assessment of the degree of 
their operation after they have occurred. Control and assessment 
of response sets has involved partialling out response sets statis- 
tically, using balanced scoring keys, using the forced-choice item 
format, and differentially scoring for separate set and content 
components (ef. Chapman and Bock, 1958; Edwards, 1957; Helm- 
stadter, 1957; Messick, 1961; Messick and Frederiksen, 1958; Web- 
ster, 1958). The acquiescent response set appears to be amenable 
to the development of separate set and content scores. However, 
the social desirability response set does not appear to be suscepti- 
ble to this approach because item desirability is usually confounded 
with item content. 
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Separate acquiescent response set and authoritarian content 
scores have been developed for the California F Scale (Adorno, 
Frenkel-Brunswik, Levinson, and Sanford, 1950) by utilizing re- 
versed or negative F Scale items as well as the usual positive items 
contained in the scale. Messick and Frederiksen (1958) developed 
a pair of formulas based on one of several models utilized earlier 
by Helmstadter (1957) for obtaining separate set and content 
scores for ability tests. The formulas are as follows:1 


С = (F,/N)) + (Us/N.) – 1 
8i = [(F./N,) — (U,/N)|/Q — С) 


where: Fa is the number of positive items agreed with. Ua is the 
number of negative items disagreed with. N; and N, are the num- 
ber of items keyed in the positive and negative direction, respec- 
tively. Messick (1961) has suggested that the above acquiescent 
score formula introduced some distortion over a portion of the 
distribution when applied to bipolar attitude scales an introduced 
a modified formula changing the algebraic value of C in the de- 
nominator to an absolute value to eliminate this distortion. This 
revised formula will be denoted S; in this paper. Triandis and 
Triandis (1962) have suggested still another scoring formula as 
follows: 


Su = (F./N,) ar (U./N.) aul 

U, is the number of unfavorable items agreed with. If an examinee 
responds in a perfectly consistent manner to the positive and neg- 
ative items, i.e., independent of the acquiescent response set, then 
the proportion of positive items agreed with plus the proportion of 
negative items agreed with should equal unity. The degree of oc- 
currence of the acquiescent response set is indicated by the extent 
to which the above sum exceeds unity. 

The purpose of the present research was to investigate (a) the 
interrelationships that exist among three different acquiescent re- 
sponse set scoring formulas and also an authoritarian content scor- 


Jackson (1961). However, they have corrected th 2 to a 1 in mailing ints 
of their article. % aep SEE 
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ing formula, (b) the relationship of each of the acquiescent re- 
sponse set formulas with an independent measure of the social 
desirability response set, and (c) the relationships of each measure of 
acquiescence, content, and social desirability with other selected 
interest, aptitude, and personality measures. 


Method 


The following measuring instruments were administered to 262 
students enrolled in Introductory Psychology at the University of 
Illinois: California F Scale, Negative California F Scale, Social De- 
sirability Scale (Edwards, 1957), and a Personality Inventory. 
They were asked to give their age and sex. Also their scores on the 
Kuder Preference Record and the School and College Aptitude 
Test (SCAT) were obtained from the University Testing Bureau. 
Both the Negative California F Scale and the Personality Inven- 
tory were developed as research measuring instruments (Pedersen, 
1965). The Negative California F Scale consists of 16 reflected 
items from the California F Scale consists of 16 reflected items 
from the California F Scale. And the Personality Inventory repre- 
sents a compilation and adaptation of four of Guilford’s scales 
(Guilford, 1940; Guilford and Martin, 1943) based on a study by 
Becker (1961). The names and descriptions of the four scales are as 
follows: cycloid disposition (emotional instability); rhathymia 
(carefree, unconcerned, happy-go-lucky attitude); thinking intro- 
version (attitude of reflection and introspection) ; and cooperative- 
ness (tolerance and cooperation). $, Su, Sm, and C were all 
scored using the 16 Negative California F Scale items and the 
corresponding 16 positive items in the California F Scale. 

Pearson Product Moment Correlations were obtained between 
all of the relevant variables. 


Results and Discussion 


The obtained correlations bearing upon the three objectives of the 
study are reported in Table 1. 

There seems to be little preference for any one of the three scor- 
ing formulas for the acquiescent response set based upon correla- 
tions among themselves, with a separate authoritarian content score, 
with the social desirability response set, and with the selected in- 
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terest, aptitude, and personality measures. All three of the acquies- 
cent scores were highly intercorrelated. The correlation of .82 in- 
dicates that Sy is only a slight modification of Si. The correlations 
of бит with S; and Sy are both higher than the correlation between 
S; and Sa. The correlation of .00 between Sr and C results from 
the fact that they were derived to be statistically independent, бит 
correlates less with C than does Sy, although not significantly less. 


TABLE 1 
Intercorrelations among Response Sets and Personality Measures 
Variable 
No. 
Variable Variable ——————————————————— 
No. Name 1 2 3 4 5 
1 Acquiescence Response Set 
I (Sr) 
32. Acquiescence Response Set 
II (84) .82** 
3 Acquiescence Response Set 
Ш (Sirr) .90** — gre» 
4 Social Desirability .04 — .03 .00 
5 Authoritarian Content (C) .00 .82** .20** —.05 
$ Age —.08  —.04  —.08 .10 .04 
7 Sex (0 = male;1 = female) .16* 14 .16* .04 -01 
8 California F Scale .36**  „61** .56** —.11 .70** 
Kuder Preference Record 
9 Outdoor -—14 | —38  —.12 —.04 -00 
10 Mechanical 12 .08 .09 11 08 
11 Computational .01 .04 .04 .20** — .05 
12 Scientific —.08 —.09 —.07 .07 —.28** 
13 Persuasive .04 .05 .04 11 .00 
14 Artistic .04 .02 04  —.02 —.01 
15 Literary —10  —.12 -.12 –.20** — 14 
16 Musical —.09  —.06 —07 104 .05 1 
17 Social Service —.18* .-—.08  —i2  —.07 -16* 
18 Clerical .22*® — 7 .21** .06 .00 
School and College Aptitude 
19 ah 
inguistic LN ae = 4 — .32** 
20 Quantitative —.M* -, 4^ = D. a —.08 
21 Total =ar —7 lige os -iut 
Personality Inventory Е í 
22 Cycloid Disposition —.04 .03 .0  —.72** .00 
23 Rhathymia -16* .15* .17* 2066. 2.07 
24 Thinking Introversbn ^ —.08 -10 и — 12355 — 08 
25 Cooperativeness —.15*  —.16* — 17% Sz 10 


*p = „05 (two-tailed test), 
**p = 01 (two-tailed test). 
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All three acquiescent measures are virtually uncorrelated with the 
Social Desirability Scale. And the patterns of the correlations with 
the other measures are almost identical. S; may be slightly preferred 
to би and бит because it is independent of C. 

Acquiescent and social desirability response sets have rather dif- 
ferent patterns of relationship with the outside variables employed. 
Females tend to acquiesce more than males in responding to the 
California F. Scale. However, there is no relationship between the 
social desirability response set and sex. 

The scales of the Kuder Preference Record are virtually inde- 
pendent of the acquiescent response set with the exception of the 
positive correlation with the clerical scale. On the SCAT the 
quantitative scale is negatively correlated with acquiescence. The 
linguistic scale is uncorrelated. And of the Personality Inventory 
scales, rhathymia® is positively correlated and cooperativeness is 
negatively correlated. Cycloid disposition and thinking introver- 
sion are uncorrelated. 

The Social Desirability Response Set is positively correlated 
with the computational scale and negatively correlated with the 
literary scale in the Kuder. It is positively correlated with rhathy- 
mia and cooperativeness and negatively correlated with cycloid 
disposition and thinking introversion in the Personality Inventory. 
The high negative correlation with cycloid disposition suggests 
that the person who scores highly on a scale measuring neuroticism 
is willing to apply a number of socially undesirable items to him- 
self. б 

The correlations of the authoritarian content score with the 
other measures assist in the understanding of the authoritarian 
personality. The correlation of .70 between C and the California F 
Scale score is rather low considering that the test is designed to 
measure authoritarianism. The magnitude of the correlations of 
the acquiescent response set scores with the California F Scale 
scores is consistent with other findings which indicate that the 
California F Seale tends to elicit an acquiescent response set (Bass, 
1955; Chapman and Campbell, 1957; Jackson and Messick, 1957, 
1958; Jackson, Messick, and Solley, 1957; Leavitt, Hax, and Roche, 
1955). The California F Scale measures both authoritarian content 
and an acquiescent response set. The fact that authoritarian con- 
tent and acquiescence are slightly correlated (except for Sr) indi- 
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cates that acquiescence may be a part of the authoritarian 
syndrome. Authoritarian content and social desirabillty are un- 
correlated. Authoritarian content is negatively correlated with the 
scientific interest scale and positively correlated with the social 
service scale of the Kuder. It is also negatively correlated with 
the linguistic scale of the SCAT. 
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THE EFFECT OF FORCING RESPONSE ON THE 
у . SEMANTIC DIFFERENTIAL 


E. R. OETTING 
Colorado State University 


Tum semantic, differential is, as presently used, essentially a 
forced choice technique. Subjects are required to evaluate all the 
concepts being studied on all of the bipolar pairs of adjectives 
used in the rating scales. A central alternative is usually present 
and the standard directions state that if the person sees the con- 
cept as neutral on the scale or if the scale is completely irrelevant 
or unrelated to the concept a check mark may be placed in the 
middle space. These directions meet the requirements for provid- 
ing a score so that the D statistic can be computed, and are as- 
sumed to allow both a neutral response and the privilege of not 
rating a concept on an adjective pair that the subject feels to be 
inappropriate. 

The meaning of a central or neutral response becomes impor- 
tant where patterns of responding to the semanti¢ differential are 

, being studied as a function of factors such as intelligence and per- 
sonality. One problem in studying variation in response may be 
due to the forced choice nature of the instrument. For example, a 
highly anxious individual may, in a free situation, utilize only some 
aspects of the semantic space available to him. However, when 
forced to respond he may be capable of judgments that he would 
ordinarily avoid and his response may show as great a variability 
as that of the nonanxious person. He may also, under the usual 
directions, develop a generalized response to the concept that he 
earries through from one set of adjective pairs to the next, so that 
the response to a pair of adjectives he would ordinarily avoid may 
show variability due to the concept class. The anxious person may 
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therefore, in a free situation, utilize less of his semantic space and 
respond to fewer dimensions in that space, but may, when forced 
to do so, show as great a variability of response as the nonanxious 
person. It becomes important therefore, to study the equivalence 
of the avoidance response and the neutral response in the semantic 
differential. This study is a comparison of the semantic differential 
under forced choice and nonforced conditions. 


Method 


The Maudsley Personality Inventory was administered to the 
2000 freshmen enrolling at the University of Alberta, Calgary, 
Canada. Nine groups, based on all combinations of three levels of 
Neuroticism and three levels of Introversion, were selected. Of the 
five males and five females from each group who were asked to 
volunteer, 33 males and 40 females were tested. The S's ranked 
ten concepts, two from each of five categories, on a 7-point scale 
over 12 bipolar adjective pairs. Four bipolar pairs, with factor 
loadings that occurred on the same dimension in several previous 
studies, (Osgood, 1962) were selected from each of 3 dimensions 9 
evaluative, potency and activity» 

The S's first evaluated all ten concepts on the 12 adjective pairs 
under directions that stated if they could not rank the concept on 
any particular pair of adjectives, they were to leave it blank. These 


` materials were then removed, The subject was given another set of 


materials and asked to rank each concept on every set of adjectives. 
If the S's left no blanks on the previous trial, they were still aske 
to repeat the rankings. Order of concepts, order of adjectives and 
direction were randomized for each S, ы 


Results 


An analysis of variance of the number of times the subjects left 
а blank space showed significant, effects beyond the .01 level for 
bipolar pairs of adjectives, concepts, and the interaction of con- 
cepts by bipolar pairs. Some concepts were clearly viewed as more 
difficult to evaluate than others, and this was probably a function 
of the adjective pairs selected. The concept “Time,” for example, 
had a large number of blank Tesponses, very few of them occur- 


ing on the fast-slow or the active-passive adjective pairs and many 
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occurring on the hot-cold, large-small, and strong-weak dimensions. 
On the other hand, ^Symphony" had half as many blank spaces, 
almost none on the good-bad and beautiful-ugly dimensions and 
about the average for the “Time” concept on the hot-cold and 
hard-soft adjective pairs. * 

For the particular concepts chosen, the bipolar pairs sharp-dull, 
hot-cold, and long-short were left blank about 20 per cent of the 
time, the terms good-bad, beautiful-ugly, active-passive, and nice- 
awful were left blank only about 10 per cent of the time. 

The concepts could be classified into five different types and the 
adjective pairs into three dimensions. The differences between 
concept types and between adjective pairs are apparently a func- 
tion of the individual concepts and pairs, since there were no sig- 
nificant differences between these broader classifications. 

There were no significant differences between the personality 
groups in the number of blank spaces left on the semantic differ- 
ential. However, an examination of the extreme groups suggested 
that, in males, the group classified as both Neurotic and Introverted 
on the Eysenck scale may have left a greater number of blank 
spaces. All other groups, including those in the middle on the 
scales and those with extreme Stores on only one scale showed 

` essentially the same number of blank responses. This comparison 

suggests that further study of extreme groups might be valuable. 
However, no comparable difference occurred in the female group, 
and the result could be attributable to chance. 

The equivalence of a blank space and a neutral response was 
checked by comparing the responses made to items left blank and 
.to items marked neutral when the task was repeated. The pattern 
of responses was almost identical. For both neutral and blank re- 
sponses, 58 per cent were marked neutral on the repeated task, 30 
per cent were marked one category away, and 12 per cent two or 
three categories away from neutral. 


Discussion 
The results indicate that, when subjects are allowed to leave 
blank spaces on the semantic differential, they will leave blanks 
where they view the bipolar adjective pairs as meaningless in re- 
lation to the concept involved. While some concepts and some ad- 
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jective pairs have more blank responses, the interaction between 
the concept and the adjective pair appears to be the important 
variable. Examining the results suggests that almost any concept 
could be rated with very few blank spaces if appropriate adjective 
pairs were selected. When the subject views the adjective pair as 
inappropriate, he will tend to leave it blank and when forced to 
make a response it is essentially equivalent to a neutral response. 

The equivalence of the blank response under nonforced condi- 
tions to a neutral response when forced does not mean that the re- 
verse is true. The adjective pair fast-slow is almost never left blank 
for the concept “Time,” but is marked as neutral about twenty 
percent of the time. 

The relationship between personality characteristics and re- 
sponse on the semantic differential, if it exists, is probably a very 
complex one. This study suggests that responses to the semantic 
differential should be considered in terms of the interactions of 
concepts and adjective pairs, and that results may vary greatly de- 
pending on the specific interactions involved. Looking for response 
sets, such as neutral sets or marking of extremes, is not likely to 
produce results unless the particular concepts and adjective pairs 
are meaningfully related to the response set, and to the personality 
characteristic being considered. An individual with certain per- 
sonality characteristics might, for example, avoid responding on 

_ the adjective pair hot-cold to the concept “Mother,” but might not 
show an avoidance response to other concepts and adjective pairs. 

The forced choice technique is apparently valid for the se- 
mantic differential, in that forced responses tend to have the same 
pattern as neutral responses, Allowing the subject to leave blank 
Spaces may, however, provide an index of the meaningfulness of 
the particular adjective pairs to the concepts involved. In future 
studies of response sets on the semantic differential the relevance of 


the concept adjective interaction to the response set should be 
considered. 
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In view of the tremendous advances that have been made in the 
adaptation of electronic computers and accounting machines to the 
processing of statistical data, sections of the Spring and Autumn is- 
sues of EDUCATIONAL AND PSYCHOLOGICAL MEASURE- 
MENT are devoted to the publication of such programs as are ap- 
propriate to psychometric procedures. Programs relevant to such 
problem areas as factor analysis, item analysis, multiple regression 
procedures, the estimation of the reliability and validity of tests, 
pattern and profile analysis, the analysis of variance and covari- 
ance, discriminant analysis, and test scoring will be considered. 
Customarily a program should be expected not to exceed six or 
eight printed pages. Manuscripts of four or fewer printed pages are 
preferred. Each manuscript will be carefully reviewed as to its 
suitability and accuracy of content. In some instances an accepted 
paper may be returned to the author for possible revisions or 
shortening. The cost to the author will be twenty-five dollars per 
page for regular running text. The extra cost of the composition of 
tables and formulas will be added to the basic rate. 


Manuscripts received up to November first will be considered 
for the Spring issue; manuscripts received between then and May 
first will be considered for the Autumn issue. 


All correspondence and duplicate manuscripts should be directed 
to: 


Dr. William B. Michael 

Professor of Education and Psychology 
School of Education 

University of Southern California 

Los Angeles, California 90007 
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A PREDICTOR-VARIABLE SELECTION PROGRAM 


DOROTHY T. THAYER ax» MELVIN В. NOVICK: 
Educational Testing Service 


Tus package contains programs to perform the following stand- 
ard types of regression analyses: full regression, forward selection, 
and backward glimination. In addition, the program provides for 
determining the optimum relative test lengths so as to maximize 
the multiple correlation for a fixed total testing time using the 
algorithm developed by Woodbury and Novick (1967). A listing 
of the program is available from the author upon written request. 


Program 


The control program is used to select the individual phases the 
user requires, The user specifies the phases desired on a parameter 
card with the proper data input, and the main program controls 
the selection of the phases. Each phase can be used separately, or 
the whole package can be used. The F4STAT subroutines which 
are on library tape at Educational Testing Service are used in each 
phase. A description of the statistical operators which comprise 
the elements of the F4STAT system is given by Beaton (1964). A 
complete treatment of the effects on reliability and validity of 
altering test lengths, as is done in phase 5, is given by Lord and 
Novick (1968). 


1 Research reported herein was supported in part by the Logistics and Mathe- 
matical Statistics Branch of the Office of Naval Research under contract Nonr- 
4866(00), NR 042-249. Reproduction, translation, publication, use and disposal 
in whole or in part by or for the United States Government is permitted. 
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Phase 1—Preliminary 


From raw score input the following sample quantities are cal- 
culated and printed: (a) cross-products matrix, (b) covariance 
matrix, (c) mean vector, and (d) standard deviation vector, 


Phase 2—Full Regression 


A least squares Tegression equation is fitted to the data using all 
the predictors. The Tegression coefficients and intercept of the linear 
model are estimated using the standard least-squares method. 
method. 

The following sample quantities are printed: (a) correlation 
matrix, (b) standard deviation vector, (c) mean vector, (d) inverse 
of the correlation matrix, (e) product of correlation matrix and its 
inverse, (f) list of n predictors used in the regression equation, (g) 
sample squared multiple correlation, R,2, and multiple correlation, 
Rn, (В) estimated residual variance of criterion given the n pre- 
dictors, (i) estimated original criterion variance, (j) F statistic 
for evaluating the increment in the sample squared multiple corre- 
lation using n predictors as compared with using zero predictors, 
(k) intercept and regression coefficients, and (1) partial correlations. 


Phase 8—Forward Selection Procedure 


all the predictors have been included, 


Before the forward selection procedure is started the correlation 
matrix and mean and Standard deviation Vectors are printed. Also, 


At each Step in the forward Selection procedure, the following are 
computed and printed: (а) Predictors included in equation, (b) 
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estimated residual variance of criterion given the included predic- 
tors, (d) estimated original criterion variance, (e) intercept and 
regression coefficients, (f) partial correlations, (g) Ё statistic for 
evaluating the increment in the sample squared multiple correlation 
using the $ selected predictors in this stage as compared with em- 
ploying the 7 — 1 selected predictors in the previous stage, and (h) 
F statistic for evaluating the increment in the sample squared mul- 
tiple correlation using all n predictors as compared with using the 
i selected predictors in this stage. 


Phase 4—Backward Elimination Procedure 


A backward elimination procedure is used to select predictor var- 
iables. A least squares regression equation is fitted to the data us- 
ing all the predictors. This is done by using phase 2. The predictor 
that will give the smallest decrease in the sample multiple correla- 
tion is eliminated first, and this procedure is continued step by step; 
at each step, the predictor that will give the smallest decrease in the 
sample multiple correlation is eliminated. The process is continued 
until there is only one predictor left. 

At each step in the backward elimination procedure the following 
are determined and printed: (a) list of predictors eliminated, (b) 
list of predictors remaining in the regression equation, (c) sample 
multiple correlation using remaining predictors, (d) estimated re- 
sidual variance of criterion given the included predictors, (e) esti- 
mated original criterion variance, (f) intercept and regression coeffi- 
cients, (g) partial correlations, (h) F statistic for evaluating the 
increment in the sample squared multiple correlation using the i 
predictors in the previous stage as compared with using the i — 1 
remaining predictors in this stage, and (i) F statistic for evaluating 
the increment in the sample squared multiple correlation using all n 
predictors as compared with using the 7 remaining predictors in this 


stage. 


Phase &—Optimizing Test Lengths for Maximum Battery Validity 


The algorithm developed by Woodbury and Novick (1967) is 
used to determine the optimum length for predictors of a battery 
given a fixed total testing time so as to maximize the multiple cor- 


relation with a fixed criterion. 
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Given a fixed total testing time a solution is attempted using the 
matrix formulas given by Woodbury and Novick. If these formulas 
yield non-negative time allocations for each predictor variable, then 
this is a valid solution, This time allocation, the correlation matrix, 
means, standard deviations, reliabilities, validities, and regression 
weights of the altered predictors at their optimum lengths are cal- 
culated and printed. The squared multiple correlation and multiple 
correlation for the battery at the altered lengths is determined and 
printed, 

If the original total testing time yields an invalid solution indi- 
cating negative individual testing times, the smallest value of the 
total testing time above which all individual testing times are posi- 
tive is determined and the valid solution for that time is given. The 


necessary variables and a valid solution for the originally specified 
time is obtained and given, 

The program then determines the time points at which each of 
the variables is eliminated from the optimal solution as the total 
available testing time decreases, 

The following are determined and printed at each stage: (a) test- 
ing time available, (b) list of predictors eliminated, (c) lengths of 
the remaining Predictors, (d) correlation matrix, validities, means, 
standard deviations, and reliabilities for the altered predictors, (e) 


Use of Special Matrix Operators in Statistical 


Lord, F. M. and Novick, M. R. Statistical Theori Mental Test 
Scores (with contributio b SN) Coen ч 
es pa cai Addison-Wesley, 1968," tra NUM 
об ошту, M. А. and Novick, M. В. Maximizi idi 
Test Battery as & Function of Relative Test Lone Lola 


Time. В i i 
New Jersey: Educational "Testing Ser i а оа 
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А FORTRAN COMPUTER PROGRAM FOR A 
MODERATED STEPWISE PREDICTION SYSTEM 


DONALD А. ROCK, JOHN L. BARONE, AN» ROBERT І. LINN 
Educational Testing Service 


Tum specific objective of the task reported here was to develop a 
computer based’ statistical system which is general in nature yet 
could yield some tentative answers to questions concerning interac- 
tions between groups of individuals and their attributes when pre- 
dicting some criterion of success. Thus, given a sample of individ- 
uals with multiple measures on each individual including a criterion 
of success, one can utilize the present system to isolate groups which 
consist of individuals characterized by common profiles on back- 
ground or personality variables, and which also yield optimal within 
group prediction systems, 

This program provides two systematic procedures for grouping 
individuals on similar background profiles for purposes of develop- 
ing within group prediction equations. Unlike previous techniques 
of introducing moderators, the power in this technique lies in the 
fact that specific moderators may be handled simultaneously or in a 
stepwise manner. That is, the program will iteratively select that 
subset of moderators which will yield groupings which have the 
largest within group multiple correlations. Thus the computerized 
procedure operates on two separate systems of variables simultane- 
ously. That is, in addition to criterion data, two different basic data 
matrices are required: aN X P predictor matrix X, a N X М modera- 
tor matrix V where N (N = 550) is the total number of individuals 
for which we have complete information on P (P S 10) predictors 
and one criterion and also М (М = 5) hypothesized moderating 


variables. 
The program uses an iterative procedure in an attempt to maxi- 


709 


710 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


mize two objective functions one of which is associated with the pre- 
dictor matrix X and is referred to as the predictive objective func- 
tion, and the second is called a grouping objective function or error 
sum of squares and is associated with the moderator data matrix, 
The grouping objective function Yields an indication of the similar- 
ity of profiles among individuals within any one group or groups 
formed on the M moderator variables or some subset of the M mod- 
erators. The predictive objective function yields an overall indica- 
tion for the predictive accuracy of the system for each unique set of 
groupings on the moderator variables, 

The computer program also has two methods available for maxi- 
mizing within group multiple correlations as defined by the predic- 
tive objective function. The first and theoretically less optimal 
model assumes common covariance matrices, but differing inter- 
cepts for the groups. In general, this approach should yield lower 
within group multiple correlations on the primary sample but will 
also be less Susceptible to sampling error and thus shrinkage on the 
validation sample, The second method is more optimal in the sense 
that it makes no assumption concerning a common covariance ma- 
trix and thus computes separate regression equations for each sub- 
group. 


Grouping on Profiles 


The distance measure for the similarity of two profiles may be de- 
fined as follows: 


hierarchial groups of mutually exclusive 


group memberships be- 
€ of two users options: 
user, or (2) j is that stage 


ginning at the jth Stage, j depending on on 
(1) j can be a parameter submitteq by the 
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where at least one of the hierarchially formed groups achieves a 
minimum size №; which also can be stipulated by the user. If the re- 
maining hierarchially formed groups have less than N; members, 
they are pooled into one group for prediction purposes. 

The grouping objective function may be defined as the sum of 
squared deviations taken about the group mean which in turn is 
taken over all groups. Therefore, in the special case in which the 
grouping objective function is applied to a D? matrix, and given any 
two people $ and 7’, the following equality holds: 


ty LS Sy rere Ee eee fau 


2 ngu m=1 g-1 i=l 


where: 


i, Ф = person, (i= 1,2,... Ng) i? 
9 = group, (g = 1,2,...@) 

т = moderator variable, (m = 1, 2,... М) and 
Vis; = observation of the mth moderator variable for the ith ' 
person in the gth group. Thus, given a total of G sets or groups the 
computer program insures the reduction to (7 — 1 sets while mini- 
mizing the error sum of squares as defined in expression 2. Within 
any one stage, the grouping objective function is computed inde- 
pendently of the predictive objective function. However, the moder- 
ator variable or combination of moderator variables which yields 
those groups having the highest predictive objective functions will 
be retained for the next stepwise level. For example, if the re- 
searcher has M moderators, M grouping procedures will be carried 
out at the first level. Each of these M grouping procedures deals 
with one D? matrix, i.e., there is one D? matrix associated with each 
of the M moderators. The one moderator variable which yields the 
highest predictive objective function will then be retained for level 
two. At level two, M — 1 D? matrices will then be computed based 
on the moderator variable from level 1 taken in combination with 
the remaining moderators one at a time. The "best" (as defined by 
the predictive objective function) combination of two will then be 
carried into level three. This stepwise procedure continues in this 
manner until all M levels are exhausted or if the increase in the pre- 
dictive objective function does not exceed some predetermined in- 

crement when going from one level to the next. 
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Within Class Regressions 


As the groups are defined based on the error sum of squares ob- 
jective function, the researcher has two options for computing his 
predictive objective function. The first option makes no assumption 
concerning homogeneity of the dispersions and means for the G 
groups at any one stage. That is, at any one stage in the grouping 
(G groups, G-1... G-n) stepwise multiple regressions for the matrix 
of predictor variables, X, will be computed within each group. 

Whether or not an additional predictor variable is added to the 
equation within any one group is a function of a predetermined in- 
crement parameter which is under the control of the user. 

The predictive objective function at stage S is defined as 


t= [а] @) 
where 


T, = predicted objective function at stage S 

Hj, = squared multiple correlation between the predictor and & 
criterion within the gth group at stage s 

Y;, = observation on the criterion for individual 7 in the gth group 

@ = number of groups which were formed at stage S 

N, = number of observations in group g at stage s 


The "s and their multiple correlations need not be computed and 
printed at every stage in the grouping but may be computed only 
when either: (a) the number of observations within the largest group 
at that particular stage attains a predetermined minimum value or 
(b) when the user stipulates a predetermined maximum number of 
groups. 


Intercept Method 


This option is based on the assumption that the groupings on 
profiles based on moderator variables will lead to groups having 
common covariance matrices but differing intercepts. Thus, at each 
stage in the grouping the computer Program will carry out the fol- 
lowing computations: 


A= - pk (4) 


— 
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where 


A’ is 1 X G vector of intercepts each element a, is the intercept for 
_ groupg 
Y’ is 1 X G vector of means for each group on the criterion 
b' is 1 X P vector of beta weights which are computed on the 
whole sample 
X isa P X G matrix whose columns are the respective predictor 
means for each group. 


There is only one multiple correlation at any one stage and it is 
computed as follows: 


В, = (А, + Хх.) © 
where . 


i = 1,2, . . . Ng; where Ng is the number of people in group g 
g = 1,2, . . . G; where G is the number of groups at stage 8 
р = 1,2, . . . P; where P is the number of predictors 


The predictive objective function is now defined as: 
Nt 
Т. = В? У Үг (6) 
+=1 
where 


N; = total number of individuals in the whole sample. 


At each stage the computer output includes group membership, 
common stepwise multiple correlation, and regression output in- 
cluding vectors of predictor and moderator means for each group. 
The predictive objective function as well as the grouping objective 
function is also printed out at each stage. 

Program listings and writeups may be obtained from the senior 
author upon request. 
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A MISSING OBSERVATIONS 
CORRELATION PROGRAM! 


PAUL A. GAMES 
Ohio University 


In many factor analysis or multiple regression studies it is neces- 
sary to administer a large test battery of tests to a relatively 
small number of subjects in each administration. When many sub- 
jects are desired, this results in many administrations, and a sub- 
stantial likelihood that occasionally one or more tests will be 
improperly administered for one group of subjects, resulting in 
“missing observations" in the matrix of scores. If the groups of sub- 
jects may be conceived of as random samples from the same popula- 
tion, the best way to provide the complete m by m matrix of 
correlations between all tests is to use a “missing observations” cor- 
relation computation procedure, Computer programs have been writ- 
ten that compute the correlation between test I and test J by using 
only the data of subjects who have scores for both tests (Dixon, 
1964; Dick, 1964) and that apply the conventional correlation for- 
mula on these data. The major problem with such programs is that 
they must keep separate records for each pair of tests on each of the 
following variables: ny = the number of subjects for whom both 
observations simultaneously exist and the sums over these ny sub- 
jects for Хь Xj, ХХ X? and Хр. 

Since there are m(m — 1)/2 combinations of pairs of tests in a 
battery of m tests, mj and each of these five sums must be dimen- 
sioned as m(m — 1)/2. The resulting consumption of computer 
storage limits the maximum number of tests that may be used in 


such programs. 
1 This project was partially supported by HEW Grant MH-10174-01. Com- 
puter time was furnished by the Ohio University Computer Center. 
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Ап alternative little-known procedure devised by Irwin and Os- 
termann (1965) permits increasing the maximum number of tests in 
a given size computer by approximately 22 per cent over the con- 
ventional solution, in addition to providing a more convenient out- 
put of the best estimate of the mean and standard deviation of each 
variable, based upon all available measures on that variable, 7. 
The conventional solution is: 


та 


nt 
‚_ рыть _ 1) $ ха-ха 22) 
LT] iil kel 8: 8; 
(4) 3 ХХ Y ХХ, 
Wu kei 

8; 8; 


. 

where the primes on the means and standard deviations of each 
variable indicate these values are those obtained on the т,; cases of 
this particular combination. The z scores and r are symbolized with a 
prime to indicate they are based upon those means and standard 
deviations. Since the number of joint observations on test I and 
test J is necessarily less than or equal to the number or observations 
on either variable alone (n; > ny and n; > лу), the means and 
standard deviations used in the above r's are often based upon 
fewer cases than are available to estimate these values. Instead of 
using an estimate based upon a smaller data set than that available, 
one may use the mean (X,) and the standard deviation (S,) based 
upon all available measures (n;) when defining the z scores of the 
correlation coefficient (га). For this case, r is defined as: 


т’ = 


Е 


til kel 8; 


8: 
and it is possible to arrive at the following equivalent formula: 


ы) 
i Tay k (Ха er Хех, СЕ 1С, =“ Xo, "x Х,)] 
р Oe EM 
“a S; 8, 
Note that the standard deviations used in the denominator are not 
the same as those used in the 7 equation, 


since these are based upon 
тапа ny cases, 


and not just the ny joint observations. Also note that 
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the numerator of 7" is the same аз the numerator of r except for the 
addition of the square bracketed term on the right of the latter. This 
term involves the discrepancy between the means of the restricted 
ти; set of data and the possibly more complete n, and ny sets of data. 
If ny is equal to n, this term will be zero, and any discrepancy be- 
tween 7’ and r will be due to the difference between Sj and 8). If 
Tu; = n, = ту then 7 and r are identical; otherwise r would be pre- 
ferred over 7” in that r is based upon more stable estimates of the 
population means and standard deviations. 

The r formula also has the virtue of requiring only the storage of m 
values of 271, X?, rather than the m(m — 1) values for 27741 Ха, and 

м Xj required by the г’ formula. This is the reason why the 
maximum value of m for a given amount of computer storage may 
be approximately, 22 per cent larger when using r. A FORTRAN IV 
computer program and accompanying write-up using the missing 
observation r solution is available from the author. 

It should be noted that neither solution is appropriate for situa- 
tions in which the elimination of subjects scores is by a systematic 
rather than a random basis, e.g., when the variables are scores ob- 
tained in successive grades of high school. Since dropping out of 
school is not a random process, the correlation between two tests 
given on entering school, and the correlation on two tests given in 
the senior year are on different populations of subjects. Although a 
matrix of correlations could be obtained by either procedure, in- 
terpretating either matrix would be a most complex, tenuous, and 
dubious business. 

One disturbing property of this solution is that since the z scores in 
the r definition may be a subset of a larger set of z scores, it does not 
automatically follow that the maximum absolute value of 57741 Zisti 
is луу. When [9224 2123] > n, the obtained absolute value of r will 
exceed 1.0. This outcome is feasible as a result of: 

1. Sampling fluctuation when the population correlation is ex- 

ceedingly high and the proportion of existing joint observations is 

low compared to the number available separately in each variable 

(e.g. p = .95, n = 100, n, = n, = 200). 

2. Violation of the random sampling condition initially specified. 

If the means and/or standard deviations based on all available 

cases differ systematically from those based on only the joint ob- 

servations, no missing observations formula is appropriate. 
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If absolute values of r greater than one are obtained, and the user 
is positive that 2 is not the ease, he should turn to the usual 7^ solu- 
tion, The absence of such values, of course, does not constitute 
evidence that random sampling was used. The author has found com- 
puted values of r and 7 rarely to differ by more than .02 when work- 
ing with a large matrix of 181 subjects and a minimum nj, value 
of 155, 
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COMPUTER PROCEDURES FOR DETERMINING 
SIGNIFICANCE WITH RANGE TESTS OVER 
DIFFERENCES BETWEEN MEANS 


STEVEN С. GOLDSTEIN, JAMES D. LINDEN, 
AND LAURENCE N. HARRIS 
Purdue University 


In the context df the analysis of variance (ANOV), the advent of 
а significant F value presents many experimenters with a distinct 
problem in procedure. The educational and psychological literature 
is replete with cases of multiple t-tests violating all assumptions of 
independence underlying this statistie. Winer (1962) has shown 
that as the number of t-tests increases, the collective level of signifi- 
cance also inereases as: 1 — (1 — o)^ where a is the level of signifi- 
cance and n is the number of comparisons made. 

Just as the experimenter may appear reckless in his use of mul- 
tiple ¢-tests, he may appear over-conservative, should he choose the 
Scheffé (1953) test where the probability of a Type I error (reject- 
ing the null hypothesis when, in fact, it is true) is at most @ for any 
possible comparison. Multiple t-tests tend to yield the largest num- 
ber of significant differences while the Scheffé test will provide the 
smallest number. 

Between these two extremes are two other procedures which offer 
much to the experimenter. These are the Newman-Keuls procedure 
and the Tukey (a) procedure (Winer, 1962). Both of these tests are 
range statistics; ùe., their use involves the consideration of all possi- 
ble differences over the range of differences when treatment means 
have been ordered. For an ordered pair, according to the Newman- 
Keuls procedure « is defined equal to a level of significance set by 
the experimenter; but if all tests of ordered pairs are regarded as а 
single test, the level of significance is lower than a. The Tukey (a) 
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procedure has a level of significance which is never lower than а. The 
value for significance for all tests, no matter how far apart the 
ordered means may be, is the highest value. Thus this procedure is a 
more conservative one than the Newman-Keuls procedure and, 
over the same data, generally will yield fewer significant differences. 
However, the power (rejecting the null hypothesis when it is, in 
truth, false) of the Tukey (a) procedure is less than that of the 
Newman-Keuls test. 

The purpose of this paper is to present a program designed to as- 
sess significant differences between ordered means through the use 
of either the Newman-Keuls or Tukey (a) procedures. The choice 
of technique is left to the user. The program was written in the IBM 
7090/7094 FORTRAN IV language that is processed by the 
FORTRAN IV (IBFTC) compiler, а component of the IBM 7090/ 
7094 IBJOB processor which is a subset of the IBSYS operating 
system (version 13). 


Procedure 


Following the program deck is the control card which is set up as 
follows: 


Columns Punch 


1,2 Number of mean values (k < 15) 
11,12 Тһе Newman-Keuls procedure is used if NK is punched 


in these columns. The Tukey (a) test is performed if TA 
is punched. 


18-22 Punch the value for Mean Square Within in F 10.4 Format. 

23-26 The sample size to be used to divide the mean square 
value is punched in F 4.0 Format, 

30-32 The value for degrees of freedom (df) associated with 
the mean square: the program will only recognize whole 
integer values in single steps from 1 to 14 and 16, 18, 20, 
24, 30, 40, 60 and 120. Should the user desire dj № e; 

20 the punch in these columns should be 999. 


Any alphanumeric information to be used as а title. 
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Following the control card are the data card(s) which are means 
for each variable, one mean following the next on the same card in 
F 84 format. The total number of data cards for any one job cannot 
exceed two. 

The program assigns numbers from 1 to k to the means and prints 
them in ordinal position. A matrix of differences is then printed 
along with indications of significance (* p < .05; ** p < .01). The 
name of the test chosen by the user is printed above the matrix. Be- 
low the matrix, the truncated range is printed horizontally together 
with the values for the distribution of the Studentized Range Sta- 
tistic at the .05 and .01 levels of significance for the df chosen. Below 
this are the significance values for the Newman-Keuls test. Should 
the Tukey (a) test have been used, the highest value on the line of 
significant valueg for the Newman-Keuls test (positioned at the far 
right) is the value used for significance. 

Any number of jobs can be run as a train with each new job being 
signaled by the advent of a new control card. 
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PROGRAMS FOR THE DETERMINATION OF THE 
COEFFICIENT AND STANDARD ERROR OF THE 
TETRACHORIC CORRELATION 


STEVEN G. GOLDSTEIN, JAMES D. LINDEN, 
AND THOMAS T. BAKER 


Purdue University 


* 

In an earlier paper (Goldstein, Linden, and Studebaker, 1966), a 
rationale and procedure for the use and estimation of the tetrachoric 
coefficient of correlation (r:) was presented. The procedure was de- 
veloped from the tables of Davidoff and Goheen (1953), having be- 
gun with the two-by-two contingency table 


Keyed Direction of Item X 
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The ratio ad/be was found, the Davidoff and Goheen tables were 
entered, and an estimate of т; was obtained. When р = p’ = .50, the 
estimate was found to be well within the standard error (e) of T+. 


However, the point has been made (Hayes, 1943) that as p or p’ 
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or both deviate from .50 the estimated value for T; increasingly be- 
comes spurious and no efficient estimate can be made of c. 

This paper presents a rationale and procedure for computer pro- 
grams which yield relatively exact values for т; and о’, 


Rationale 


Pearson (1900) showed that the solution for т; was actually an 
infinite exponential expansion. Early work by the principal author 
using a Burrough's E101 computer showed that the quadratie ex- 
pression gave a close approximation of T; with residuals of the or- 
der 1 x 10-5. This quadratic expansion may be represented as 


Enn -—- [aot] = () 
where а, b, c, d, N and r, are defined above and y is the height of 
the ordinate at a point 2 units from the mean of the distribution in 


question. Once the solution of this expansion has been achieved, the 
residual may be assessed as 


Residual = 54 = [= т? + "| 


Guilford and Lyons (1942) proposed an estimation of c, in the 


form 


which, for ease of discussion, can be summarized as А.В. С. / D-E. 
Hayes (1943) took issue with this procedure, arguing that the use of 
this approximation provided spurious values as p and p' diverged 


was defensible. He was especially 
use of the estimated value given by the 
у when л, = 0, V D-E must equal unity. 
In presenting results comp in 7, derived from both the use of 


rmulation, Hayes 


аСт" 
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The programs discussed below were written with the express pur- 
pose of testing the null hypothesis p — 0. The assumption of a two- 
tailed test is made, and the program output indicates values for the 
.05, .01, .005 and .001 levels of significance. 

The program was written in the IBM 7090/7094 FORTRAN IV 
language that is processed by the FORTRAN IV (IBFTC) com- 
piler, a component of the IBM 7090/7094 IBJOB processor which, 
in turn, is a subset of the IBSYS operating system (version 13). 


Tetrachoric Correlation Program 


The control card for the r; program which immediately follows 
the program deck is read first. The set up for this card is as follows: 


Columns $ Punch 


`1 Punch А 
2 If input data is from cards for Group А, punch C. If it is 

from tape, punch T. 
If C was punched in column 2, columns 3 and 4 are used 
to indicate the number of cards that are to be summed to 
make group A. If T was punched in column 2, columns 
3-10 are used to indicate the group numbers that form 
group A and which appear in column 1 of each data card. 
These numbers may be any single column digit from 1 to 9. 

11 Punch B 

12-20  Identical with columns 2-10 in reference to group B. 

21-26 Any alphanumeric identification information desired on the 
punched output. 

31-80 Alphanumeric title which will appear on the top of the 
first page of the print out. 

Data cards follow the control card. They are so arranged that all 
cards to be summed for inclusion in group A for the first question 
come first and are followed by all cards for group B for the first 
question. This procedure is followed for the remaining items. If the 
contingency table cell subtotals are on tape, the only restriction is 
that they are in numerical order regardless of group membership. 
That is, all data for item 1 must be together and must be followed 
by all data for item 2 in the same group order as the data for item 1. 
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The program has a fixed input format and input data cards must 
be punched as follows: 


Columns Punch 


1 Group number. 

24 Item number, 

10-13 Totals (or subtotals) keyed for cells b or d of the contin- 
gency table. 

20-23 Totals (or subtotals) keyed for cells a or c of the contin- 
gency table. 

The output for the program first prints the item number and then 
the total values for a, b, c, d, and N. On the following line, values 
are given for p, р’, т; and the residual. The punched output contains 
values for the item number, М, р, p’, тү, the residual and alphanu- 
meric identification from columns 21-28 of the control card in the 
format I3, I5, 4Е15.8, 6X, Аб. 

In the solution of any quadratie, the occurrence of a pair of im- 
aginary roots is always a possibility. Should this happen, the pro- 
gram prints DISA S T E В and no card is punched. 

This program is written so that it may run many jobs as a train, 
Tecycling each time a new control card is encountered, 


Significance Program 


р and р’, the standard error when p=0, 
terisks as follows: * p € .05, ** p < .01, 


кыл time the program encounters a different alphanumeric 
punch in columns 75-80, it skips to the top of a new page, prints the 
new identification and continues, 
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7094 TEMPLATE PROGRAMS FOR SCORING 
DICHOTOMOUS RESPONSE FORMAT TESTS WITH 
SPECIAL REFERENCE TO THE MINNESOTA 
MULTIPHASIC PERSONALITY INVENTORY 


STEVEN G. GOLDSTEIN; JAMES D. LINDEN, 
AND THOMAS T. BAKER 
Purdue University 

Savines in time and effort have stimulated the development of 
many procedures for the electronic processing of tests and ques- 
tionnaires (Shuman, 1966). From the earliest mark sensing devices 
whieh simply counted responses, scoring programs have appeared 
with increasing regularity. That many programs have been specific 
to a given test or one user’s system has made the employment of the 
procedure by others difficult, if not impossible. Users of popular 
tests such as the Strong Vocational Interest Blank (SVIB) or the 
Minesota Multiphasic Personality Inventory (MMPI) are well 
aware of the problems (e.g., cost and time loss) encountered by the 
necessity of sending answer sheets to outside agencies to be scored. 

For some time, the authors have been concerned about the prob- 
lems encountered with the scoring of dichotomous-response format 
tests such as the MMPI. These concerns finally prompted the writ- 
ing of the programs described below. While these programs were 
written to score, norm, and code the MMPI (Dahlstrom and Welsh, 
1960), they are generalizable to any situation where a dichotomous 
response format is employed. 

The scoring program recognizes only 0 and 1 punches, However, 
the assignment of positive and negative valences to these punches is 
left to the discretion of the user. The second program uses the card 
output of the scoring program as its input. Depending upon the se- 
lected option, the program will provide standard (T) score output 
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for raw score input data. When used with MMPI data, if desired, 
standard scores may be represented in Welsh Code. 

The programs were written in the IBM 7090/7094 FORTRAN IV 
language that is processed by the FORTRAN IV (IBFTC) compiler, 
a component of the IBM 7090/7094 IBJOB processor which, in turn, 
is a subset of the IBSYS operating system (version 13). 


The Scoring Program 


The scoring templates must follow the program deck immedi- 
ately. A separate template set is required for each scale to be 
scored. Preceding each template set is the control for the template 
in the form: 


Columns Punch 


1-3 The alphanumeric symbols which will be printed above 
the score for this scale, 

5 The K fraction to be added: the program recognizes only 
the symbols 0, 2, 4, 5, and 1, where 0 indicates no K frac- 
tion addition; 2, 4, and 5 are .2, 4 and .5 of К, and 1 
indicates the addition of the entire value for K. 

7-9 Тһе number of the last test item on this particular scale. 


Following the template control card are the template cards them- 
selves. Columns 1-9 are free for any identification that the user may 
desire. Columns 10-80 are punched in the format 71A1 and are 
keyed in the desired direction for score. This procedure may be re- 
peated for as many as 25 template sets, 


Following the final template is the main control card which is in 
the following form: 


Columns Punch 


11-13 The total number of items on the entire test. 
15-17 The number of unanswered items that the user is willing 
» age and still have scores produced for a given subject. 
е tape unit number for input. (If the data originally 
Were on cards and loaded on a system analogous to the one 


М described above, the punch would be 5 for logical unit 5.) 
Output tape unit (logical 6). 
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Columna Punch 


23 If printout of scores without the K fraction added is de- 
sired, punch an asterisk (*) in this column. Otherwise, 
leave the column blank. 

24 If K fraction added scores desired, punch *. Otherwise, 
leave the column blank. 

25 If the user desires punched output of scores without the K 
fraction, punch *. Otherwise, leave the column blank. 

26 An * punch gives K fraction added punched output. Other- 
wise, leave the column blank. 

28 An * punch in this column indicates that a title card fol- 
lows which will be printed on the top of each page. Other- 
wise, leave the column blank. 

29 The * in this column indicates a title card which will be 
printed three lines from the top of each page. 

This title card can be used in combination with the card 
mentioned for column 28 or it may be used alone. 

31-36 Alphanumeric identification which will appear on punched 
output in columns 11-16. 

80 An * must be punched in this column. 


Following the control card are the title card(s) and then the data 
cards. The specifications of the program require that the data cards 
will have a seven column wide identification number in columns 1-7, 
card sequencing digit in column 8 and that column 9 will be blank. 
Data, in 71A1 format then follows on the card. A 9999999 punch in 
columns 1-7 follows the last data card and acts as an end-of-file 
statement. 

The printed output provides for ten subjects on a page. The scores 
are listed horizontally with the scale symbol. Should the number of 
unanswered questions (no punch in a given column) for a subject 
exceed the figure punched in columns 15-17 of the control card, the 
printout will so note, and no score will be printed. Incorrect or im- 
properly ordered data sets also will be noted. The number of unan- 
swered questions for each subject always will be printed under the 
symbol “Q” and be the first punched score on the output cards. 
Punched output from this program is in the following form: 
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Columns Punch 


1-7 Subject number. 
8 A 0 is punched if the option of no K fraction added was 
selected, A 1 is punched otherwise. 
11-16 Alphanumeric identification from columns 31-36 of the 
control card. 
20-70 Scores for up to 25 scales in I2 format. 


T Score Program 


This program was written specifically for the punched output of 
the scoring program described above, However, with certain restric- 
tions, any data may be processed as noted below., 

Following the program deck is the control card which is set up as 
follows: 


Columns Punch 


12 Thenumber of variables (k < 25). 

11,12 Two asterisks (**) indicate that the user desires the 
Welsh code option on the MMPI long form. One * in 
column 12 indicates Welsh code on short form. 

21-80 Аһу alphanumeric title to be printed on the top of each 
Page, 


Directly following the control cards are cards carrying the scale 
symbol to be printed above the score and the mean and standard 
deviation for the scale, (The Programs reported here do not com- 


A variable type format card follows the 
scale cards, The format 
card is ris by the data cards whose format must begin I7, 13X, 
+. Should the Welsh code option have been selected, then the for- 


р... where y is some two digit 
number. Generally, if the long form of the MMPI was used, y > 14. 


le symbols, the raw score and T scores 
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to the nearest whole number, If the Welsh code option was selected, 
the ordered scale numbers with associated symbols (see Dahlstrom 
and Welsh, 1960, pp. 18-22) are printed below the T' scores. The 
end-of-file statement is the same for this program as it is for the 
scoring program, i.e., 9999999 in columns 1-7 of the last card. 
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A PROGRAM FOR SCORING SEVERAL TYPES OF 
PROBABILISTIC TESTS 


ROBERT M. RIPPEY 


Center for the Cooperative Study of Instruction 
The University of Chicago 


PropasiListic tests are tests which require students to assign 
probabilities or weights of preference to each response. Such tests 
may contain items having unique correct responses, or they do not 
need to be made up entirely of such items, Rationales for such scor- 
ing procedures have been developed elsewhere, (Baker, 1965; Mas- 
sengill and Shuford, 1965; Roby, 1965; Shuford, 1965; Toda, 1963). 
These rationales are based on several considerations: (a) Providing 
the student with a fair game for maximizing his score without re- 
sorting to guessing, (b) Increasing the reliability and validity of in- 
struments and (c) Allowing the test constructor to ask questions 
which do not have unique correct answers—questions which require 
a statement of differential preference responses. These tests, how- 
ever, are difficult to score, since they require the computation of 
scores which are, in most instances, functions of all the probabili- 
ties assigned to the responses for a particular item. 

In order to conduct further studies into the properties of these 
tests, the author has written а computer program for scoring prob- 
abilistic tests. Three scoring functions are available at the option of 
the user. For tests containing items which do not all have unique 
correct responses, the Euclidean scoring function is available (Rip- 
pey, 1967). For scoring tests with single unique correct answers, the 
simpler spherical and truncated logarithmic functions are available 
(Shuford, Albert, and Massengill, 1966). These programs are writ- 
ten in Fortran for a computer with 40K memory, such as the IBM 
1620. Programs are available from the author at the University of 
Chicago, Chicago, Illinois, 60637. 

785 


736 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
The input requirements for each phase of the program are as fol- | 
lows: 


INPUT REQUIREMENTS 
PROBABILISTIC TEST SCORING PROGRAM 
PHASE 1: Determination of key from criterion responses. (For 
Euclidean functions only) 
Card 1 


Column 1: Inputformat: 1= Normal 2 = Porta Punch KF | 
Column 2-3: Number of Items (less than or equal to 72) KI - 
Column 4: Number of responses per item (less than or | 


equal to 6) KN 
Column 5-6: Number of cards per subject KK | 
Column 7-8: Number of items per card KIC 


Column 9-48: Any 40 character title 


Column 1-5: Indentification number of the subject. Each subject 
must have KK cards, each containing the same ID | 
number. 

Column 6: Card sequence number 1 to 9 depending on the num- 
ber of cards per subject. | 

Column 7-78: Criterion group responses for each item response (an 
integer from 0 to 9). These integers may be consid- 
ered as probabilities, or they may be weights of 
preference. If a subject elects to choose a single | 
answer from the response options, he should mark а 


1 on that response and nothing for the other response | 
options. 


| 
] 
| 
Criterion response cards follow. Each card has the following format: | 
| 
| 
| 
| 
| 
| 


Last Card 
Column 1-2 = 1, 


PHASE 2: Scoring subject or criterion responses. Computation of 
standard deviation of item scores. 


Card 1 


Column 1: Standard deviation option. 1 = 
Column 2-3: Scoring function LS 


Truncated logarithmic. 


no. 2 = yes. 
Euclidean, 2 = Spherical, 3 = 
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Card 2 is automatically punched out if Phase 1 is used. Otherwise, 
a card having the same data as card 1, Phase 1, must be prepared. 
The subsequent card arrangement depends on the scoring function 
used. 


1. Euclidean 


Following card 2 are the key cards which are the punched output of 
phase 1. Then follow the data cards which are in the same format as 
the criterion cards for phase 1. 

The last card in the deck is a —1 in columns 1 and 2. 


2. Spherical or Logarithmic 


Card 3: Key card. This card contains the number of the correct re- 
sponse punched inte the column corresponding to the number of the 
item. 

Next come the data cards for the subjects. These cards are punched 
according to the same format as the criterion response cards, 

The last card has a —1 punched into columns 1 and 2. 


PHASE 3: Subtest scoring reliability and validity computations 
and subtest intercorrelations. 


To use phase 3, some work must be done on the cards. All of the 
cards containing total scores must be sorted out of the pass 2 output. 
This is done by sorting out all 0 punches in column 1. If this is done, 
all other cards will fall into the reject pocket of the sorter in the 
proper order for subsequent analysis. If validity is to be computed, a 
single card is placed after each subject’s item score cards. This card 
contains the identification number in columns 1-6, and the criterion 
score in columns 7-12. This constitutes the new packet of input data 


cards for phase 3. 


Card 1 
The first card is removed from the punched output deck from 
phase 2. This is the last card punched in the phase 2 output. On this 


card, punch the following data: 
Column 1: The number of subtests (less than 9). 


Column 10-41: A new title. 
Column 56: 1 if validity is to be computed, otherwise, blank. 
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Card 2 Subtest Key 


Punch each item’s subtest number into the column whose number 
corresponds to the number of item. For example, if item seven is on 
subtest three, punch a 3 in column 7. 

Then follows the data card packet. 

No end card is necessary for the phase of the program. 

In the phase three output, the highest numbered variable repre- 
sents the total score, the next highest variable represents the sub- 
Ject’s score on the even numbered items. The variable preceding 
that represents the scores on odd numbered items. The variable be- 
fore that represents the criterion variable in the case that validity is 
called for. Otherwise, it represents the last subtest. 


Samples of answer sheets, inputs, and outputs are available from 
the author. 
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A DESCRIPTION OF SEVERAL TEST REPORTING 
INNOVATIONS AS PROGRAMMED FOR THE 
1130 COMPUTER 


HARRY J. CLAWAR 
Educational Records Bureau 


Рирӣ-Вејетепсе Group and Intra-Individual Comparisons 
Tum first section is a description of a combination pupil, class, 
norm-population histogram and table of significant differences. 
These two treatments of test data permit the teacher or counselor 
to make three types of observations for each pupil. 
1. pupil vs. class median comparisons. 


2. pupil vs. norm population median comparisons. 
3. intra-individual comparisons (differences among all possible 


pairs of subtests for a battery). 
The following describes how the above-mentioned comparisons 


are made. 


I. Histogram р 
А. There are three vertical bars of X's for each subtest. 
1. pupil bar—the height equals pupil's score. 
2. class bar—the height equals class median. 
3. norm bar—the height equals norm median. 


B. There are indicators of the importance of pupil vs. class and 
pupil vs. norm-population differences. 
1. The letter D printed at the top of the class bar indicates 
that the pupil differs significantly from the class median. 
2. The letter D printed at the top of the norm bar indicates 
that the pupil differs significantly from the norm popu- 
Jation median. 
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3. 


When the space at the top of either the class or norm bar 
is blank it indieates that, in terms of the error of mea- 
surement for the test, the pupil score and that group 
median cannot be considered to represent different 
levels of performance. 


C. Necessary Data and Computer Operations 


1. 


2. 


A one dimensional array of standard error by subtest is 
entered into the computer, 

For testing pupil vs. class median differences (pupil is a 
class member and entered into median) the following 
operations oceur: 


(a) for each subtest the standard error is multiplied 
by VN —1/N; where М equals.the number of 
pupils in the class, 

(b) the result of ‘a’ is multiplied by 1.80 (15 per cent 
level of significance multiplier, considering the lesser 
sampling stability of the median than the arithmetic 
mean). 

(c) the pupil-class difference must equal or exceed the 
result of ‘b’ in order for the D to appear at the top 
of the class median bar. 


. For testing pupil vs. norm population medians the fol- 


lowing operations occur: 


(a) a norm median table is supplied to the program. 

(b) for each subtest the standard error is multiplied by 
1,80. 

(с) pupil norm-group difference must equal or exceed 
the result of ‘b’ in order for the D to appear at the 
top of the norm-population median-bar, 


II. Table of Significant Differences 


ANXN dimensional table is established where № equals 


the number of subtests а 
error of the difference Ь 
tests multiplied by 144 
multiplier). 


nd each cell entry is the standard 
etween a score on each of the two 
(15 per cent level of significance 
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B. Differenees between all pairs of subtests are compared 
against the entries in the above table. 

C. For differences between pupil scores, equal to or greater 
than the tabled values, an asterisk is printed in the Table 
of Significant Differences in the cell indicating the inter- 
section of the two tests. 


Norms of Class Achievement 


Flanagan (1951) has pointed to the fact that assigning percentile 
ranks, based on a distribution of pupil scores, to grade or class me- 
dians leaves much to be desired. A distribution of group averages 
tends to display much less variability than a distribution of indi- 
vidual scores. In addition there appears to be no constant relation- 
ship between these two types of distribution. 

Therefore, percentile ranks for distributions of classes in the norm 
group (at present only for an independent school population), in 
order to accurately reflect class achievement, are looked-up by the 
computer and printed opposite each class median. 
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4 PROGRAM FOR THE ANALYSIS OF CERTAIN 
TIME-SERIES QUASI-EXPERIMENTS:! 


THOMAS 0. MAGUIRE? anp GENE V GLASS: 
University of Illinois 


IN a recent article, Box and Tiao (1965) developed a method of 
evaluating the change in level between two successive points in time 
of a nonstationary time-series, Observations 2, are taken at 
equally spaced time intervals and one wishes to make inferences 
about a possible shift in level of the time-series associated with the 
occurrence of an event at a particular point in time. This is pre- 
cisely the situation described by Campbell and Stanley (1963) as a 
time-series quasi-experimental design. Several observations are 
taken before and after the administration of a treatment, T, e.g., 0; 
0» 03 Т 0, Os Og. If there is an abrupt shift in the level of a time-series 
between the third and fourth observations, evidence of a treatment 
effect may exist, Campbell and Stanley recognized the shortcom- 
ings in the statistical tests they suggested as possible analytic tech- 
niques (Campbell, 1963; Campbell and Stanley, 1963). The model 
and statistical techniques developed by Box and Tiao (1965) ap- 
pear to be the most suitable methods now available which might 
have application to the analysis of,time-series quasi-experiments. 

The statistical model underlying the Box-Tiao analysis of change 
in level of a time-series is the integrated moving average model. 


t-1 
4 =L+a, and 2 =L+y 5 а +a, (8.1а)* 


1 The work reported in this paper was supported by a grant from the U, 8. 
Office of Education. 

2 Now at the University of Alberta. + 

з Now at the Laboratory for Educational Research, University of Colorado. 

* Equation numbers are the same as those in Box and Tiao (1965). 
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for the n; observations prior to the introduction of 7’, and 


1-1 р 
2 = 1+5 +7 Me Tte. (3.15) 
for the n = № — n; observations following Т, where: 


2, is the value of the variable observed at time f, 
L is a fixed but unknown location parameter, 
yis a parameter descriptive of the degree of interdependence of 
the observations in the time-series and takes values 0 < y < 2, 
а, is a random normal deviate with mean О and variance o°. 
6 is the change in level of the time-series caused by Т. 


Essentially the model implies that the system is subjected to 
periodic random shocks (a) a proportion (y) of which are absorbed 
into the level of the series. Data which conform to the model in 
(3.12) evidence the following properties (among others) : 


1. The graph of the time-series follows an erratic, somewhat ran- 
dom path with slight, but no systematic drifts, trends, or 
cycles. 


№ 


The correlogram (ie., the graph of the auto-correlations) of 
the observations, z; does not “die out," (ie., does not tend 
systematically toward zero as the lag between values corre- 
lated increases) nor does it show cycles characteristic of cyclic 
time-series. 


3. For the N — 1 differences between adjacent observations, z; — 
211, the lag 1 correlation equals (y — 1)/[1 + (y — 1)?] and 
all higher lag correlations equal zero. 


By setting у, = 2, and y, = z — y У) {2 (1 — Yz- the 
model can be written as Y = X 0 + e where X is defined as an 
N X 2 matrix of weights as follows: 


уты [0-0 6 исни а" T 
02008 0 и а: 
9 1за2 X 1 vector containing as elements L and 3 and е зап МХ 1 


vector of random normal deviates, е7 = (еу... ay), the elements 
of which have mean 0 and variance о2. 
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When y is known, simple least; squares estimates of L and 8 can be 

found from the familiar solution to the least-squares normal equa- 
tions: ' 


б, { 294 (3.4) 


Box and Tiao showed that the least squares estimate of à, namely, 
$, has a t distribution with М — 2 df when divided by an appropriate 
estimate of its standard error. 

When y is unknown (as will generally be true) a Bayesian analy- 


sis using sample information about y is used in making inferences 
about à. The posterior distribution, h(y|z), of y given a set of N ob- 
servations and assuming a uniform prior distribution is known to 
within a constant of proportionality, 


y2 — у 
h =) ___ a ore aa Tomar od (Yuasa (5.8) 
Gld аа 
where s? is the residual variance and is given by 
2 1 Ty _ атут 
Уто g T d d 0"X^X0), (3.18) 


for a given value of y. 


Computing Procedures 
"The present program performs the following operations: 


- The correlograms (autocorrelations for lag 1 through lag 3n,/4 or 
3n,/4) are calculated for 2, and for 2, — 2,4 separately for pre- 
treatment and posttreatment data. 

2. The posterior distribution, h(y|z), of y given the data z for a 

uniform prior distribution is caleulated and plotted. The value of 
h(y|z) is found for 200 values of у from .01 to 2.00 in steps of 

.01. For each of these 200 values of у: 


m 


. 6 is calculated, 
. The variance error of $ is calculated, ) 
. The t-statistic equal to the ratio of Ê to its standard error is 


calculated and plotted on a graph which is superimposed on the 
graph of A(y | 2). 


oP о 
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Formation of the matrix of weights X in (3.3) raises problems with 
“underflow.” As y approaches 1.0, successive elements of X get very 
small. Consequently, when | z | < 1.0 X 107°, the effects of the 
weights can be considered negligible and each subsequent value of 
ти is set equal to zero for that у. The same problem and treatment 
apply to 2,2. 

The vector 6 is formed by equation (3.4) and the residual variance 
S calculated from formula (3.18). The standa rd error of § is given by 


asl) = at s&0G-30--3"N 2 — 
[[— 0 =)" — (1 — т)" — 4 

The t value for testing the significance of the difference of § from 
О is given by t = $/4($ | y). 

Special problems of computer accuracy arise in the calculation of 
the posterior distribution of y. When s? is large, the posterior distri- 
bution can be very small, since s--? is a multiplicative factor in 
formula (5.8). To prevent any problems associated with this caleu- 
lation, all factors are transformed to 1021 prior to the calculation of 
the ordinates of the posterior distribution, which are then trans- 
formed by subtracting the largest value thus obtained. Thus, all 
values of the posterior distribution are divided by the maximum 
value. With one slight exception, antilogs are taken to restore the 
values to their original form. The exception is that when the log 
of a given value differs from the log of the maximum value by more 
than 35 (ie. given value/maximum value < 10-5), the value of 
the ordinate corresponding to the given value is set equal to zero. 


Each of the values of Ё, $, s*, др), t, A(ylz), is stored for each 
value of y. (The values of the posterior distribution of у are rescaled 
by fitting trapesoids so that the curve has unit area.) 


Illustrative Analysis 


An illustrative analysis will be performed on data from an ex- 
periment by Deese and Carpenter which was adapted for presenta- 
tion in Brown (1961, pp. 118-119). Two groups of rats were given 
24 training trials in running a short alley for food. Group A had 
been fed wet mash for one hour prior to the experiment; group B 
had not eaten for 22 hours. After 24 trials the conditions were re- 
versed, group A being deprived of food for 22 hours and group B 
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being fed for one hour prior to a final eight trials. Observations 
were made of the length of time between start of a trial and a run- 
ning response for each rat. Observations were converted to loga- 
rithms of this latency period for each rat which were then averaged 
and divided into 1 for both groups. The reciprocals of the average 
log latencies for groups A and B over 32 trials appear in Figure 1. 
Data for Group B (high drive followed by low drive state) appear 
as the solid line; the broken line is for Group A. 

The significance of the effect of shifting from a low drive state to 
a high drive state (Group A) is apparent and would not be en- 
hanced by further statistical analysis. However, the significance of 
the slight downward shift between the 24th and 25th trials for 
Group B is worthy of further investigation. 

Assuming that the fundamental time-series process generating the 
series is the same for both Groups A and B, the n; — 24 pretreatment 
observations for both groups should provide a reasonably good, 
though somewhat unstable, estimate of the correlograms for raw 
data, zs and differences, 2, — 21. The n; = 8 posttreatment ob- 
servations are insufficient in number to add any substantial infor- 
mation concerning the fit of the model. Even the small number of 


Reciprocal of average log latency in seconds 


‘Trials 
groups of rats before and after a reversal in deprivation 
schedule. 


Figure 1. Starting speeds for two 
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pretreatment observations probably represents fewer than a mini- 
mal number from which one may draw inferences about the fit of 
the model with any confidence. The correlograms for Groups A and 
B for the data in Figure 1 are calculated. To conserve space these 
correlograms are not reproduced here. These two sets of autocor- 
relations evidenced neither cycles nor systematie dampening effects 
characteristic of time series of types other than moving average 
series. On the basis of inspection of Figure 1 and the correlograms 
of the original data we can continue to entertain the model of 
equations (3.1a) and (3.1b). 


The adequacy of the model is further investigated by observing 
the correlogram of the differences between adjacent observations in 
a series, i.e., the correlogram of 21.1 — Zs, = 1, +++, N — 1. The cor- 
relograms were calculated for the pretreatment datà in Figure 1. 
The lag 1 through lag 10 autocorrelations of the differences be- 
tween successive observations appear in Figure 2. 


The correlogram for 2, — z,-1 where 2, conforms to the model in 
(3.12) should show a lag 1 correlation of (y — 1)/[1 + (y — 1)*] and 
lag k autocorrelations of zero (k > 1). For sample data, an approxi- 
mation to the standard error of an autocorrelation coefficient due to 
Bartlett (1946) is available. The slanted, straight lines at the top and 
bottom of Figure 2 mark off a distance of two standard errors of the 
autocorrelation coefficient of lag К, 2c,,. Note that only one (lag 
9—Group В) of the 18 autocorrelation coefficients in Figure 2 lies in 
a region of rejection that the population value of an autocorrelation 
of lag greater than 1 is zero. 

As will be seen later, the maximum likelihood estimates of y are 0 
and .25 for Groups A and B, respectively. The lag 1 autocorrelation 
for Group A is almost equal to the expected (on the basis of the 
model) value of (y — 1)/[1 + (y — 1)?] = — .50. If the model fits 
exactly, one would expect a lag 1 autocorrelation for Group B of 
(25 — 1)/[1 + (— .75)?] = — 48. The obtained value of — .28 
does not differ significantly from this expected value. Acknowledg- 
ing the limited power of these statistical tests and the fact that to 
accept uncritically a model on the basis of so few observations is 
largely a matter of faith, we proceed with the analysis of the data 
for Group B assuming that the model of (3.14) and (3.1b) holds. 

In Figure 3 appear the following: (a) the posterior distribution 


e-——-e Group А 


Autocorrelation Coefficient 


Figure 2. Autocorrelations (lag 1 through lag 10) for differences between adjacent observa 
tions in the two time-series in Figure 1. 
of y, which indicates the likely values which y might assume for 
these data (the maximum likelihood estimate of y is found by 
noting that value of y under the peak of the eurve), (b) the £-sta- 
tistic used in testing the hypothesis that 8, the shift in level of the 
series associated with the events between trials 24 and 25, is equal 
to zero, and (c) the values (dotted lines) of t needed for significance 
at the .15, .10, and .05 levels in testing the hypothesis H.: 8 = 0 
against the hypothesis На: 8 < 0. Note that if y = .20, Н, can be 
rejected at the .05 level in favor of H.. If y is .50 or above (which 
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Figure 3. Graphs of the posterior distribution, h(y|z), of y and t = 8/4( 3ly) as 
тер у. (The dotted lines mark the 15th, 10th and 5th percentiles in 
the t-distribution, dF = 30.) 


appears relatively unlikely), H. cannot be rejected at the .15 level. 


Our impression is that the data do indicate a statistically significant 
shift downward after the 24th trial for Group B. 


Availability 


A listing and source deck of the FORTRAN II program with 
which these data were analyzed is available on request. The com- 
plete solution including plotting for a problem with 60 pretreatment 


points and 48 posttreatment points requires about 90 seconds, using 
the IBM 7094, 
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Educational Measurement by Richard H. Lindeman. Glenview, Il- 
epe Scott, Foresman, and Company, 1967. Pp. 173. (paper 
ack). 

If one accepts the proposition that the subject matter of measure- 
ment and evaluation can be presented in a condensed fashion, 
Lindeman's book is probably among the best of the lot. The fore- 
word to the book indicates that is was written for the publisher's 
Keystones of Education Series which is intended to be a set of 
*. . . relatively brief but authoritative books, selective in content so 
as to develop in considerable depth key areas of knowledge . . . the 
series may be profitably used in both introductory and advanced 
courses, for the instructor is free to construct a course with the con- 
tent, emphasis, and sequence he desires, by selecting a combination 
of books to serve as text material.” The book certainly meets the 
criteria of being brief and authoritative. Whether it can be used 
profitably in both introductory and advanced courses seems open 
to speculation. 

The preface supplements the foreword by stating, “First, he (the 
author) has made it clear that the basic theory of educational meas- 
urement is genuinely simple and requires no quantitative sophisti- 
eation. Second, he has demonstrated that the profitable application 
of the principles of measurement to the recurrent problem of class- 
room entails no demand greater than mastering a statistical ‘lan- 
guage' which is less extensive in its vocabulary and less complicated 
in its grammar and syntax than Basic English. Third, he has shown 
how, with this minimal equipment, a teacher can evaluate himself 
and his pupils... ." 

Forewords and prefaces, of course, are supposed to say these 
kinds of things. Generally, however, there is a letdown when the 
text content is encountered. In this respect, the present book com- 
petes very well. In fact, the letdown is perhaps a bit greater than 
most. Assuming that the text is directed primarily to undergradu- 
ates and graduates taking their first training in measurement and 
evaluation, it is highly unlikely the volume. can do the things 
claimed for it. 

Basically, the book is composed of six chapters, each providing a 
condensed treatment of the usual common cord of subject matter 
found in the area, 1.е., the nature of measurement, basic statistics, 
eriteria for test evaluation, construction and analysis of classroom 
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achievement tests, standardized tests, and procedures for marking 
and reporting. Thus, the content presents nothing unusual or unique 
from the subject matter standpoint. Two condensed appendixes ac- 
company the six condensed chapters. One appendix includes tables, 
and the other a well presented set of computational procedures for 
the statistical concepts presented in the main body of material, 

In summary, for those who like brief but authoritative attempts 
to treat a broad range of subject matter, Lindeman’s effort will be 
quite well received, Instructionally, the book might be considered 
for classroom use by the instructor who views a text as purely sup- 
plementary to his teaching. However, the instructor who treats a 
text as the basic anchor for instruction might very well go down 
with the ship using this book. Although only alluded to in the pref- 
ace, its greatest utility might be as a brief and authoritative review 
for people who have received instruction in measurement and eval- 
uation at an earlier period in their professional training. 

James C. Моовы 
The University of New Mexico 


Introduction to Statistical Analysis and Inference for Psychology 
and Education by Sidney J. Armore. New York: John Wiley 
and Sons, 1966. Pp. xx + 545. $8.95. 

Among the many introductory statistics texts that are currently 
available, it is often difficult —if not impossible—to discern features 
that make any given text a unique or a specially attractive contri- 
bution to the literature. So it is a strange “feeling” to be confronted 
by a text that has a deep and an immediate appeal—a quality that 
makes it stand out in the crowd—without being able to point to 
specific features that result in this kind of affect. 

Possibly, one of the reasons that enable publishers to support the 
wide range of statistic texts being offered is that the place for the 
introductory statistics course in the total academic program has not 
been settled with even а remote degree of finality, in that in some 
institutions it is an undergraduate course, in others a graduate 
course, and 80 on. Perhaps the “uniqueness” in Armore’s text lies in 
its potential flexibility of use, not only in terms of level, but also in 
terms of the background required of its readers, 

к The text is particularly suited to those who want to study statis- 
tical methods on their own. There is a very thoughtful “note to the 
instructor,” provided by the author, whi i 
material both at the “essentials” lev 
stage. The “self-instructor” thus is better able to foll 
mum path, 

Clearly, the book has been Prepared with the 


beginner in statistics in mind, The “a tary nonmathematical 


nature of the text 
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has been attained by presenting the substance in a high degree of 
detail and with a careful attention to the "why" of each procedure 
in largely nonmathematical terms. Since, however, it is virtually 
impossible to do much in statistics with any success, without refer- 
ence to some mathematics, the author has provided, in an appendix, 

„е basic rules of algebra. Although this appendix is so “cook- 
bookish" as to make any mathematician shudder in horror, it must 
be admitted that most first courses in statistics in both psychology 
and edueation still receive a high proportion of students who need 
this kind of review potential during the early stages. 

For the more mathematically oriented student who has both the 
patience and manual dexterity, a series of proofs and derivations 
appears at the end of some chapters as “algebraic notes.” Although 
the main body of the text may be read without reference to these 
notes, it is difficult to imagine many courses that would not include 
the material as essential. And as a matter of fact, the extent of these 
notes is not so gteat as to justify the annoyance of having to fumble 
for the end of a chapter to see whether a mathematical proof has 
been included. This over-concern for the nonmathematical reader, 
plus the occasional gay assertion that an “adequate explanation of a 
method or a procedure is omitted because it is not appropriate to in- 
clude such material in an introductory text” renders the text less 
valuable for the more serious students. However, sets of suggested 
supplementary readings are occasionally given—especially with the 
more “advanced” content—and these have obviously been chosen 
with care and insight. 

Each chapter includes a number of problems for practice and for 
review. On the whole, these are not demanding, computationally or 
conceptually. Answers are provided for even-numbered problems, 
providing adequate feedback to students who want to work on their 
own. 

Although comparisons are frequently invidious, it may be helpful 
to the reader if one places this text somewhere in the hierachy of 
well-established stand-bys in the field. In terms of composition, it 
lies near the realm of the first sexteen or so chapters of Ferguson’s 
Statistical Analysis in Psychology & Education, and ultimately 
makes about the same quantitative demands. In terms of presenta- 
tion, the book approximates the style of Guilford's texts, though the 
depth is not so great. Thus, if an instructor is looking for a text 
somewhere in this intermediate domain, Armore's is an alternative 
that is worth considering. 

The content is not new; nor is it organized in any startlingly new 
fashion. There are five major sections: Orientation and Basic Con- 
cepts, Statistical Description and Analysis, Foundation for Statis- 
tical Inference, Applications of Statistical Inference, and Associa- 
tion and Prediction. 
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"The first part contains the inevitable attempts to establish a basic 
vocabulary, a minimum of notation, and a rationale for the study 
of statistics. As a mathematician, the latter seems rather specious, 
but no doubt it satisfies the more applied consumers. The real 
strength of this section lies in the clarity of presentation of the уо- 
cabularies—verbal and numerical. When read in conjunction with 
the appendix, the latter has been particularly well done. One might 
argue somewhat with Armore's discussion of scales of measurement. 
Using the Steven's classification into nominal, ordinal, interval, and 
ratio scales, Armore rather clumsily defines his levels. For example, 
he definies an ordinal scale as a scale of relative importance. This is 
& quite unnecessary restriction. Similarly, the loss of a chance to 
define admissable transformations of data seems regrettable. 

In part two, the reader may grind through the usual descriptive 
measures. Even though each is rather nicely presented, the expendi- 
ture of 140 pages on these issues is perhaps hard to justify. 

In contrast, part three is rather hurried. The part’consists of three 
chapters—one on probability, one on random sampling, and one on 
“theoretical distributions” (which turns out to be a scarcely-theo- 
retical deseription of the normal and the binomial distribution. 
With a few qualms about the lack of real rigidity in presentation, 
the chapter on probability is simple, direct, and to the point. The 
examples are excellent. Each is worked out step by step, and by ap- 
peal to rules of operation. Unfortunately, it is never made quite 
clear why one chooses a given rule rather than another. But for all 
this, the chapter is probably (sic) one of the better examples of 
writing on the topic, The chapter on sampling is a little too devoid 
of mathematical justification to make it completely convincing, but 
is a stimulating first approximation to some of the problems in the 


area. The desire not to frighten the nonmathematical student is at 
its worse in the chapter on theoretical distributions: an over- 
reliance on words and diagrams, results in a tiringly long discus- 
sion of the normal curve and its properties, and of the binomial 
distribution, 

Part four is concerned with ар] 
hypothesis-testing and decision- 
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demonstration that this topic may be handled well at an elemen- 
tary level. 

In summary: little that is new but much that has been very ele- 
gantly said. Enough variance in difficulty-level to satisfy a wide 
clientele. This text would be a suitable alternative to some of the 
better-established ones. 

Peter A. TAYLOR 
Rutgers University 


Kindergarten Evaluation of Learning Potential by John A. R. Wil- 
son and Mildred C. Robeck. Santa Barbara, California: Web- 
ster Division, McGraw-Hill Book Co. 1965. Pp. xvii + 242. 

The KELP technique is an approach to the identification during 
kindergarten of the learning capacities of children. The first section 
of the book presents the authors' theoretical rationale, describes the 
purposes of the instrument, tells something of the development of 
the items, provides some evidence of validity, and discusses the 
teachers’ role as an observer. The authors consider the major pur- 
pose of the technique to be “. . . to teach the child the skills and de- 
velop the motivations on which he will be evaluated." (p. 13). The 
KELP consists of eleven items that the teacher uses for both in- 
struction and evaluation over the year of kindergarten. These items 
are: (1) skipping, (2) color identification, (3) bead design, (4) 
bolt board, (5) block design, (6) calendar, (7) number boards, (8) 
safety signs, (9) printing, (10) auditory perception, and (11) social 
interaction. Except for skipping and color identification, each item 
is scored at three levels: *(1) association, a process which includes 
operant conditioning, (2) conceptualization, the grasping of whole 
ideas, and (3) creative self expression.” 

The approach was developed initially as a technique for use in 
teaching and informal day-to-day evaluation, and it is in this con- 
text that it shows the most promise. However, the authors also 
advocate its use as a formal tool for measuring intellectual develop- 
ment and for this purpose serious questions must be raised. Some of 
the points that the evaluation specialist might wish to consider in- 
clude the following: 


1. Since the same items are used for both instruction and evalua- 
tion, they constitute a biased sample for evaluation and no 
generalization seems possible to the item universes of which 
the items are a part. 

2. Since specific instruction is given on each item, it would appear 
that the pupil’s behavior on these items would reflect not only 
his level of development but also the skill of the teacher. The 
authors present no information that would provide a basis for 
estimating the magnitude of the teacher skill variable. 
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3. The instrument relies heavily upon the teacher's ability to 
function as an unbiased observer. It would seem almost in- 
evitable that halo effect would influence the teacher's judg- 
ment, especially for those items where scoring is subjective 
and calls for a fairly high level of inference, The authors re- 
port no studies that would provide a basis for estimating the 
amount of halo effect. ү 

The usual sequence involving item development, trial, revision, 
retrial and validation was followed by the authors to arrive at the 
eleven items that constitute the current measure. The authors re- 
port some evidence of face validity, ie. teachers who had used the 
technique perceived it to be useful and related to the kindergarten 
curriculum. 

Evidence that might be considered related to both concurrent and 
construct validity is presented in one study in which KELP scores 
are correlated with scores on the Revised Stanford-Binet Intelli- 
gence Scale. The present form of KELP was used ih a study of one 
teacher and her 45 pupils. Correlations between Stanford-Binet MA 
and IQ and KELP scores ranged from .60 to .73. The teacher in- 
volved in this study was using KELP for the first time. 

Tn a predictive validity study, one teacher with two sessions of 40 
and 39 children used the KELP technique. At the end of grade 1 
these children were given the Metropolitan Achievement Test 
(MAT). The correlation between MAT Reading and KELP was .60, 
while the correlation between MAT arithmetic and KELP was .76. 
In a more extensive study, KELP scores for 280 children in 5 schools 
were correlated with achievement ratings rendered by their first 
grade teachers. The correlations between ratings and KELP total 
Scores varied considerably from school to school, ranging from .28 
to .81. Other validity studies in which the investigators used earlier 


versions of the KELP produced results that were consistent with 
those presented in this review, 
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adequate than for the less concrete behaviors such as social inter- 
action. 

In reviewing the instructions for scoring the items, it is apparent 
that scoring some items calls for a much higher level of inference 
than others. The teacher faces a much more concrete and well de- 
fined task in scoring level one of the block design items in which the 
child must reproduce a block design he sees on a card than in level 
three of the calendar task in which the child must explain the social 
significance of a holiday. Experience with observer behavior would 
certainly suggest that most teachers would score the former be- 
havior much more reliably than the latter. 

Since the KELP technique is purported to be an evaluation tool, 
it is difficult to see why the authors suggest departures from a stan- 
dard approach for some of their items that change the nature of the 
task without providing any advantages that compensate for the loss 
of the standard situation. For example, for item 10 (auditory per- 
ception) the device consists of a box with three partitions, each 
containing at least six objects having names that start with the 
same sound. The teacher, however, is given the option of including 
more than six objects in each compartment which would appear to 
change the difficulty of the task. At level one the child must name 
five of six objects in each compartment and articulate the names 
correctly. However, since some namés are more difficult to articulate 
than others, and teachers are free to select different objects for this 
item, it would appear that errors are being needlessly introduced. . 

Most of the scoring instructions would be improved by giving 
more specific information and a wider range of examples of satis- 
factory and unsatisfactory behavior. For some items such as block 
design level three, calendar level three, and numbers levels two and 
three much more information seems essential if different teachers 
are to obtain comparable information from the measure. 

The third section of the book includes information on the use of 
the KELP technique to identify various non-typical pupil groups, 
and as an aid in progress reports and parent conferences. This sec- 
tion also deals with interpretation and use of norms. 

Suggestions are given regarding pupil behavior on the KELP 
items that might help the teacher identify gifted, retarded, emo- 
tionally and socially maladjusted children as well as children with 
specific disabilities or learning problems. Although the information 
given in this chapter is very brief, it would probably improve the 
average teacher's skill in identifying non-typical children. The au- 
thors suggest that children suspected of mental retardation, malad- 
justment, visual or hearing defects be referred for further testing or 

' diagnosis. Thus, there seems little danger that the teacher who uses 
KELP will come to believe that this technique qualifies her to make 
a final diagnosis. 
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The KELP appears to have real potential as an aid to the kinder- 
garten teacher in making progress reports and conducting teacher 
conferences, Since it requires keeping records and also draws the 
teacher's attention to a variety of pupil behaviors, it should result in 
better focused conferences and reports. 

The norm tables for the KELP are not provided in the book, but 
may be obtained from the publishers. Little information is given re- 
garding these norms. For example, the authors indicate that the 
1965 norms are based on about 2000 cases. Virtually no information 
on the nature of the sample or the sampling technique is given ex- 
cept that "The sample was limited to schools and teachers using the 
materials on a self-supporting basis.” (p 218). This chapter is 
probably the weakest in the book containing only 6 pages of text, 
much of which is concerned with a description of the noraml curve 
and the stanine. The chapter contains almost nothing specific about 
interpretation of the KELP norms. A sample page from the norm 
tables would have been helpful. One gets the impression that this 
chapter was added rather reluctantly in an attempt to give some as- 
pects of a standard measure to an approach that was developed as a 
day-to-day teaching aid. The authors suggest that “Interpretation 
by the teacher on a day-to-day basis may continue to be more im- 
portant than will the year end tabulation” (p 215). At its present 
stage of development this appears to be the only legitimate use for 
the KELP technique. The kind of day-to-day appraisal it permits, 
Куш, Is important and in itself sufficient justification for its own 

зе. 


Waren В. Bore 
Far West Laboratory for 

è Educational Research and Development 
(Berkeley) 
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Testing and the interpretation of test data are not mystical en- 
terprises. But they are not ventures to be embarked upon with 
carefree abandon. Too little learning in the testing field is a decid- 
edly dangerous thing considering the kinds of life-decisions that are 
based on test information. Thus it may be better not to promulgate 
superficial half-knowledge, thereby inculeating an unreasonable 
overconfidence in psychometric ability, but to concentrate on show- 
ing teachers and counselors how to make use of real specialists in 
the testing area. If this book had set out as its purpose to expose 
test-data consumers to minimal vocabulary, it might have ap- 
peared more successful. But then one has to acknowledge that there 
are excellent free glossaries published by several test companies. 

Still, there is no denying that someone must be using the book. It 
has run into a second edition, after all. So one question is, what are 
its present users getting from it? 

‘An answer has already been given, in a general way: some kind of 
vague sense of what testing is about. Note that this is an explicit 
distinction from what testing 18. 

Let it be clear that the reviewers have no aversion to teachers 
and counselors “knowing” what educational measurement is about: 
the concern is that too little knowledge be stretched too far. Test 
theory really is considerably more technical than many writers 
would have one believe. Notions of “how to choose a test” are all 
very well—but when life decisions are based primarily on “inter- 
esting information,” alarm is not unjustified. And it is the opinion 
of the reviewers that much of the substance of Downie’s book is 
treated in a fashion that can only, with some effort, be construed as 
substantively adequate. That anyone should feel any more than 
uneasily competent to handle statistical questions after studying 
this text will be fortuitous; that anyone should know more than the 
dictionary existence of concepts such as validity will be only 
slightly less fortuitous; and the better parts of the volume (such as 
those sections describing item-types) are in direct confrontation 
with other established books in the area. 

The first chapter defines and explains evaluation. It includes 
three pages of philosophy of evaluation which could be entitled “a 
teacher’s guide to democracy in the classroom.” While few would 
deny that the thoughts expressed are admirable, one might question 
the appropriateness of such an emotional presentation in a book on 
measurement. Common terms are then defined. This section is of 
dubious valtte. The terms are explained more fully when the topic 
is covered at length. Furthermore, a glossary is just as, if not more, 
useable. 

In the second chapter, the usual elementary statistics are ex- 
plained in brief, simple terms that should not be too difficult for the 
average high school student. Surprisingly enough the £ test is in- 
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cluded here; this is unusual for a book on measurement, especially a 
primary level one. Reliability and validity are dealt with in a later 
chapter. 

roni grades, and norms are treated adequately. One wonders 
about Downie’s rather naive plea for not always grading on a curve. 
Surely, the high school student is sophisticated enough to intuit this. 
The comments on final grades which are naive might be better 
suited to a section on classroom tests. Downie does make a strong 
point in this section: "Students should know in advance, long in ad- 
vance, at the beginning of the course, just what factors are going to 
make up the final grade.” 

The treatment of reliability is adequate, though elementary. 
However, Downie maintains that testing conditions can affect the 
reliability of a test. This has been empirically demonstrated to be 
untrue. Validity is covered briefly, but reasonably. 

Chapter 5 gives an overview of the use of standardized tests and 
methods of scoring them. The reviewer's only complaint is that 
Downie again stresses the when and where of testing as important 
factors in the scores obtained by students. It would have been pref- 
erable if the section on external testing programs had been limited 
to a description of the tests. Instead, Downie philosophizes on the 
merits and dangers—mainly the dangers—of such programs and on 
ways to deal with them. 

Part II of the book is devoted to issues concerned with achieve- 
ment tests. The six chapters in the section contain some of the 
better content. 

Chapter 6 discusses—as always, very briefly—the “nature, uses, 
and general directions for construction” of achievement tests. The 
chapter starts with an historical review of the progress of the 
achievement-testing movement, and in so doing proceeds to a rather 
narrow and biased definition of "evaluation." The uses of achieve- 
ment test data are seen as six fold, including an animistic establish- 
пш p grades; diagnosis (ignoring questions about the ‘misuses of 
Mee DOR or reliability of part-scores) ; through the evaluation 
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so, but not always, especially in the evaluation of course content, 
for example. The chapter ends with paragraph-long discussions of 
test length and test directions, and a rather dated description of 
corrections for guessing. One suspects that it does not help a non- 
specialist too much to know that the use of correction-for-guessing 
formulae “is a controversial issue among the experts in test con- 
struction.” And the proposed solutions are not altogether free of 
controversy themselves. 

Chapters 7 and 8 are excellent. Both chapters deal with the types 
of item one might use in constructing objective tests. The suggestions 
and examples given are short and to the point, are remarkably com- 
prehensive, and are well discussed. Chapter 9 gives a well-reasoned, 
low-key review of some of the disadvantages and advantages of the 
essay test. A failing in the writing style is made rather clear, how- 
ever. There are several long lists of pros and cons, and do’s and 
don’ts that tempt parrot-like learning in this chapter. 

Chapter 10 is a blatantly cook-book attack on the procedures of 
item-analysis. This is not a totally unkind criticism—cook-books 
are known to evince delectable fare. But without some theoretical 
notion as to why his cake failed to rise, the cook can be lost. And for 
the teacher without some knowledge of the theoretical (distribu- 
tional) limitations on the common item analysis statistics, there 
can result some fairly soggy interpretations. 

Part II ends with a conventionally organized chapter on stan- 
dardized achievement tests. This kind of chapter is difficult to write 
without its becoming a dull catalog. Downie has succeeded ad- 
mirably in avoiding this trap, but has lapsed into occasional specu- 
lative musing that might have been better left unsaid. 

Part III makes an attempt to summarize the main instruments 
for measuring intelligence, special abilities, adjustment (“personal- 
ity”), interests and attitudes; evaluating socioeconomic status and 
teaching effectiveness. It is in these final chapters (12-18) that far 
too much superficial, happenstance flitting about occurs. If teach- 
ers and counselors ever needed to use most of these devices, they 
would need far more training than this section hopes to provide. 
And there is always the question as to just how many instruments 
like the Binet, WISC, Rorschach, and EPPS (to name a miniscule 
few) are appropriare for use even by counselors. 

The bulk of the substance of this section is typical. The usual de- 
scriptions ава commentaries on “common” tests appear (one won- 
ders—based on the number of text-books that use it as an example 
—really how “common” something like the Horn Art Aptitude In- 
ventory is). There is nothing remarkable here. There are some 
naive, but vaguely interesting, side trips into such areas as the his- 
tory of intelligence testing, the uses and abuses of personality tests, 
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and the like. The chapter on evaluating social status and teaching 
are rather strange ones. Maybe they will be useful to someone. 

The end of each chapter is adorned with one or two exercises 
based on the content of that chapter, and with one of the best fea- 
tures of the book—a list of suggested readings. There is an ap- 
pended glossary of measurement terms, a list of test publishers, 
and a full index. 

This book is not an impressive one. Yet it will serve many people 
who want the barest outline of what measurement is about, and 
who have little time to consider any topic in other than a super- 
ficial manner. 

PETER А. TAYLOR 
Susan F. Forp 
Rutgers University 
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to familiar classifications. This serves as an efficient and meaningful 

way in which the reader may use the text as the index for which, it 

is assumed, it was originally intended, 

The author of this treatise on the occupational validity of psy- 
chological tests needs no introduction to the reader. That he is emi- 
nently qualified both in the fields of industrial psychology and 
psychological measurement is an unquestionable fact. As one would 
expect, therefore, the text is lucid and authoritive. The level of dis- 
cussion is elementary, and Dr. Ghiselli is careful to indoctrinate the 
uninitiated reader, early in the text, on the major basic concepts 
dealing with psychological measurement and prediction, e.g. relia- 
bility, linearity of regression, correlation coefficients, and z-trans- 
formations. He makes a very important point (p. 24) which 
many texts unjudiciously ignore, and that is to call attention to the 
reader that test validity may to a considerable extent be dependent 
not only upon the reliability of the predictor measures, but also of 
the criterion. However, in spite of all this, there is much of which to 
be critical in this volume: 

1, Many misleading statements are presented to the reader. For ex- 
ample: the term “excellent” predictor is associated with the cor- 
relation value r = .46, when it is obvious that the predictor ac- 
counts for only approximately 20 per cent of the variance in the 
criterion. Evidently, Dr. Ghiselli did not heed his own early cau- 
tions to the reader, e.g. to consider test results only in relative 
terms! 

2. The procedure for gathering and preparing validity data pre- 
sented in the tables is not carefully or explicitly explained. Pre- 
sumably a variety of studies was investigated and combined; the 
r's were transformed to z's and averaged. In doing so, it would be 
deemed appropriate to have presented to the reader the number 
of studies as well as the number of cases involved in the averag- 
ing. In any event, the author does take some license in the indis- 
criminant averaging of r's obtained from many independent 
samples. 

3. The current text is presumed to have had as one of its purposes 

the intention of bringing the earlier monograph up-to-date. If 

this were true, as this writer feels it should be, the necessity of 
discussing tests such as the MacQuarrie, which is now somewhat 
out-of-date and not widely used, is questionable. 

On several occasions he makes seemingly unwarranted compari- 

sons between validities for proficiency criteria as is done on page 

37, “. . . seem to be a bit more valid for proficiency criteria (.23) 

than for training criteria (.17).” Both these r’s are too low for 

making meaningful comparisons; if we consider the fact that 

т? (.23) = .05 and г? (.17) = .03. Really, does it make much dif- 

ference if we say that one predictor accounts for five per cent of 
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the variance in one criterion and only three per cent in the other 
criterion and therefore that it is more valid for predicting the 
first criterion? 


All-in-all, however, this is an important and necessary volume to 
be in the possession of every user of psychological tests, if for no 
other reason than the sobering effect it must have on those who are 
all too willing at times to put too much weight upon the predictabil- 
ity of any single measure. But, it is indeed unfortunate that there is 
no significant discussion of the value of combining the individual 
tests into predictor systems and integrating their data with other 
non-test measures in enhancing the predictability of occupational 
success, 


Regarding the organization of the material presented to the 
Ms there are four major categories into which the text is di- 
vided: 


I Background on Psychological testing and presentation of 
те methodologieal concepts are covered in Chapters 1 
and 2. 

II Chapters 3 and 4 involve essentially diseussions and tables 
on the validity of tests for occupations classified according 
to the GOC and DOT systems against both training and 
performance criteria, 

III Chapter 5 attempts to define a three-dimensional. ability 
space for the classification of occupations. ` 

IV Chapter 6 deals with the power of the tests for predicting 
occupational success. 


The principal value of this book in this reviewer's opinion is as a 
reference text, The test data which Dr. Ghiselli has so laboriously 
collected and analyzed extend from a vintage of 1919 to 1964. Aside 
25 the Sheer presentation of these massive validity data on psy- 
tie logical tests applied to business and industry, however, the text 

little to offer the sophisticated reader. The treatment of basic 
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intent of the reviewer to point out to the reader certain limitations 
and restrictions about which some caution is deemed necessary in in- 
terpreting and applying the information and data to which he is 
exposed. 
PrTER Е. MERENDA 
University of Rhode Island 


Explorations in Human Potentialities by Herbert A. Otto (Ed.). 
Springfield, Ill: Charles C. Thomas, 1966. Pp. xix + 558. 
$17.00. 

This volume consists of a series of 35 original essays by 39 au- 
thors. Its keynote presented in the first chapter by Gardner Mur- 
phy and Lois Barclay Murphy, is that society must find methods 
for utilizing the full potential creativeness of the individual. Her- 
bert Otto, the editor, feels that this can best be accomplished by 
establishing ay Institute of Human Potentialities for surveying, cat- 
aloging, and conducting research on the topic. One is reminded 
here of the not uncommon conclusion to scholarly papers that ^. . . 
more research needs to be done.” 

The editor has, within the pages of this volume, collected the 
opinions of personologists (Gardner Murphy), development вре- 
cialists (Charlotte Buhler), psychiatrists (J. І. Moreno), para- 
psychologists (J. B. Rhine), anthropologists (Margaret Mead), 
sociologists (Bartlett Stoodley), theologians (Donald Wells), edu- 
cators (Paul Heist), therapists (C. H. Hardin Branch), psycho- 
analysts (O. Spurgeon English), existentialists (Adrian Van 
Kaam), and many others. Consequently, it might appear that a 
wealth of data and sound theory would emerge from this joint en- 
terprise. But although the book is delightful to read—many of the 
essays being written in a popular vein—the reviewer felt that the 
end product of these specialized labors was not unlike the descrip- 
tion of the elephant given by the blind men of Hindustan. In truth, 
the majority of the writers realize the inchoate state of knowledge 
on the subject of human potentialities, and they would probably 
readily admit that much of what they have said here is merely free 
association. Therefore, the (obviously intended) function of this 
book is to serve as a goad to research and thinking about the topic 
in question, rather than being a summary of research and theory. 
As such, its relatively nontechnical style and content should appeal 
to layman and professional alike. 

Lewis В. АткЕМ, JR. 
Guilford College 


Research in Clinical Assessment by Edwin I. Megargee (Editor). 
New York: Harper and Row, 1966. Pp. xv + 702. 
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Planned as a book to give the much overworked professional 
clinician an opportunity to keep abreast of ongoing research in the 
assessment of clinical behavior, this volume consists of fifty-eight 
relatively recent investigations from twenty American, British, and 
Canadian journals in which several well-known psychologists ex- 
pound at length on the issues and basic problems underlying the 
methodology and techniques of assessment as well as on the skills 
of clinicians to evaluate clinical data. Consisting of carefully- 
chosen articles which have been thoughtfully organized into thir- 
teen chapters, with some chapters consisting of as many as nine 
contributions, the volume consists of three major sections: (1) Part 

—‘Problems in Validating Clinical Methods,” (2) Part II—“Stud- 
ies of Specific Techniques of Assessment," and (3) Part III—“The 
Integration of Clinical Data in Assessment.” In Part I discussion 
centers on several general problems of research and assessment in- 
cluding the validation of constructs, base rates and their uses, the 
criterion problem, and response sets. Concern for specific techniques 
of assessment including a detailed consideration of problems unique 
to projective personality tests and structured scales underlies Part 
II; in addition, representative validation studies of the more signifi- 
cant and widely-known personality tests are included. In this sec- 
ond section the editor endeavored to have one study included which 
illustrated how a test was employed in a global fashion, a second in- 
vestigation in which an effort was made to validate some individual 
scale or score, a report upon the validity of an instrument in a clin- 
cal setting, and an article showing how the validity of an instru- 
ment was established in a laboratory situation. A third and final 
section was particularly rich in research methodology that is con- 
cerned „with the problems underlying clinical versus statistical 
prediction. In particular, attention was directed toward (1) the inte- 
gration of clinical data in assessment procedures, (2) the determi- 
nation of how well clinicians when presented with numerous sources 
of data can predict criterion measures or outcomes, and (3) an 
do xia of whether the clinician or statistician including the elec- 

Tonic computer is the better qualified to integrate the data of clini- 
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cognitive outcomes will actually find much in common with the 
endeavors extended by the many clinical psychologists whose works 
have been reproduced. In the reviewer's estimation the quality of 
the articles chosen is unusually high. This volume should serve as 
an important reference source for a number of years both for the 
graduate student in psychology and for the research psychologist 
whether he be in an area of psychology that emphasizes achieve- 
ment and aptitude testing in personnel selection or whether he be a 
research worker intrigued with problems of assessment in per- 
sonality or in clinical behavior. 

Wam B. MICHAEL 

University of Southern California 


Human Factors Evaluation in System Development by David 
Meister and Gerald F. Rabideau. New York: John Wiley & 
Sons, 1965. Pp. 307. $9.95. 

This book i$ frankly pragmatic in its approach to human en- 
gineering procedures in the development of space and weapons sys- 
tems, at times pragmatic to a fault. In general, the authors do little 
theorizing and devote their space to describing what, and how much, 
is and can be done by the human factors specialist in the develop- 
ment of a system. On the good side the book does take the “systems 
approach” to human engineering. It avoids both the compendium of 
detailed specifications of human capabilities and limitations and 
the collection of unevenly balanced articles that are typical of 
many books in this area. The bad side of the coin is that in present- 
ing a fairly comprehensive coverage of human engineering at all 
stages of the developmental cycle the content tends to be somewhat 
elementary. The emphasis tends to be upon hardware systems 
(where, of course, the authors are most qualified to speak) and not 
upon communication or information processing systems where soft- 
ware assumes a large role and may present subtler problems than do 
missiles and space capsules. 

The book describes with some fidelity the tools and techniques 
available to the human engineer in gathering information on re- 
quirements in evaluating designs and components and in conduct- 
ing tests of systems and subsystems. These techniques include in- 
terviewing, questionnaires, rating scales, and task, function, and 
link analyses and may include extensive simulation models and 
mockups and experimental trials. The authors indicate correctly 
that in the heavily engineering oriented milieu of system design and 
production there is seldom inclination to spend much time, money, 
or capability on extensive human factors experimentation. However, 
in rejecting “laboratory” and elaborately designed and controlled 
experimentation, the authors seem also to reject much of the exten- 
sive collection of empirical data that is possible with today’s elabo- 
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rate instrumentation. It would have been useful if the authors had 
indicated in addition to what is done, what should or could be done 
to realize a better human factors evaluation job at each state. 

The book sometimes seems too acceptant of a second-class role 
for human engineers. In Chapter 4 (Product Evaluation), it seems 
quite clear that neither the authors nor their hardware cohorts rec- 
ognize the human engineer as a full-fledged member of the team. He 
is expected to take what he can get and go off by himself to evalute 
it rather than to participate in a team evaluation. However, even 
here the authors do good service by pointing out some of the things 
that can be done in evaluations from drawings, blueprints, and other 
specifications. Undoubtedly even more could be done with the paper 
products of the development process. 

Several chapters are devoted to field and system tests (although 
testing in an operational environment is largely ignored). The tests, 
however, seem more concerned with hardware performance than 
with human performance (which might be expected’since these are 
by-and-large joint tests.) Adequate testing of systems of any kind 
that involve human beings is an-area where too little is done and 
where too much emphasis is placed on expressions of customer satis- 
faction and phenomenological observations and too little upon ob- 
jective evaluations of operational ease and utility. More support 
from the authors for such tests would have been gratifying. 

Little is also said concerning the reliability of human beings as 
system components or of the tolerance of systems to human error; 
but the general problem of human error, skills, and system main- 
tenance, and equipment reliability is fairly well covered, 

Overall, the book does cover its subject, albeit at a sometimes 
ШОУ level. Although measurement problems and data collec- 

problems are touched upon, the rejection of statistical treat- 


categories or by 
what performance reliabilit 


с at the system is reasonably ea. t 
1s operated by human beings, and that the human Виан 


ресібей requirements, To do any of 
т Bad е much objective evidence 
to b i €, and has very little time or oth 

support to help him in doing his job. Despite its ЖОН сола, 


BOOK REVIEWS 771 


this book gives some grasp of the size of the problem and of the ap- 
proaches that have been reasonably successful. This is, however, а 
rapidly developing field, and one would hope for more books of this 
nature soon. 

Norman E. WILLMARTH 

System Development Corporation 


Conditions of Learning by Robert Gagné. New York: Holt, Rine- 
hart and Winston, 1965. Pp. viii + 308. 

In this significant volume, Gagné has brought into clear focus two 
major lines of research with which he has been concerned during the 
past few years: the formulation of categories of learning and a hier- 
arachical description of learning structures. : 

In the interest of parsimony, many psychologists have tried to 
subsume one of the two forms of learning, classical and instru- 
mental, under the other. Gagné follows the opposite approach, sug- 
gested first by* Tolman and more recently by Melton, proposing 
eight different categories of learning. The first, signal learning, in- 
volves respondent conditioning, while operant conditioning is de- 
scribed by the other seven classes, arranged in order of complexity: 
stimulus-response learning, chaining, verbal association, multiple 
discrimination, concept learning, principle learning, and problem 
solving. Gagné makes the assumption that to attain each of these 
levels of learning requires the mastery of other appropriate classes 
lower in the hierarchy. For example, learning to state a principle is 
only a verbalism if the individual has not previously acquired the 
prerequisite concepts. 

This organization of the learning process can be criticized from 
opposing points of view. On the one hand, it can be argued that the 
types of classifications proposed by the author are arbitrary and 
constitute excess baggage which make it unnecessarily complicated 
in using a knowledge of the learning process to plan instruction. An- 
other aspect of this criticism deals with the redundancy which 
Gagné’s organization cannot avoid. For example, he is forced to dis- 
cuss topics such as interference and transfer not only under each of 
several of his categories, but in a separate chapter as well. Gagné, 
however, convincingly demonstrates that one or two forms of learn- 
ing do not provide a useful framework for the diversity of learning 
phenomena. For example, he clearly points up the inadequacy of 
attempting to deal with verbal learning simply as a more complex 
type of “shaping” procedure. 

The opposing type of criticism would argue that the forms of 
learning cannot be understood in terms of a reductionist principle 
implied by Gagné. The higher cognitive processes, it is maintained, 
involve structures which are qualitatively different from the sim- 
pler processes. The position taken by Gagné should not be evaluated 
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in terms of its truth or falsity; it would be fruitless for one to as- 
sess Gagné's position in terms of whether he has described learning 
"the way it really is." Note that his assumptions of an hierarchy 
are subject to empirieal test. The evaluation of his system will come 
in terms of how much his classification structure furthers research 
and practice in education. In this respect, Gagné's formulation 
shows considerable promise; his categories should be far more fruit- 
ful for experimental studies in school learning than Bloom’s descrip- 
tive approach in the Taxonomy of the Cognitive Domain. Further- 
more, Gagné accepts a behavioral orientation which requires treat- 
ing complex processes in terms of measurable performance under 
definable conditions; since he continuously emphasizes both the de- 
pendent and the independent variables, readers with a different 
theoretical orientation may thus still find Gagné’s classification 
system of value. 

During recent years, as educational psychology has become con- 
cerned with the problems of cumulative learning and curriculum 
Sequences, it has dealt with larger and larger units. Gagné, in his 
presentation of the topic of learning structures in Chapter 7, de- 
scribes the hierarchical relationships among the different learning 
outcomes in such a sequence. In his discussion, the author offers 
good illustrations of learning structures in the field of mathematics, 
Science, foreign language, and English. Unfortunately, Gagné does 
not clarify the Telationship between a learning structure (with its 
empirically verifiable characteristics) and a subject matter struc- 
ture. He has made, however, a significant contribution by spelling 
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The Rehabilitation of School Dropouts by Norman М. Chansky. 


Springfield, Illinois: Charles C. Thomas, Publisher, 1966. Pp. 
267. $9.75. 

This book describes the results of evaluation of a project for high 
school dropouts, Operation Second Chance, conducted in North 
Carolina and funded by the Office of Manpower, Automation and 
Training, and the Area Redevelopment Administration. The proj- 
ect’s aim was to “raise the general educational and technical thresh- 
olds of hard core unemployables, but most especially the school 
dropout ... to instill a life purpose to the diffident dropout and to 
make available trained manpower for the local community and 
State labor markets.” To this reviewer these objectives were only 
partially achieved, at best. But the story bears telling in detail, and 
in keeping with the orientation of this journal it seems appropriate 
as well to give some attention to certain measurement procedures. 

The author of this volume apparently had no direct involvement 
in the actual planning of the project. Thus, he can function more 
effectively as the dispassionate observer. As he viewed it, his as- 
signment was to “define the several phases of the program and to 
determine their effects.” 

The book comprises two parts. The first contains five chapters; 
the second 30 case studies, and two appendixes containing certain 
test score data. 

The first chapter presents an excellent overview of the problem 
of the school dropout. Unlike many accounts the problem is pre- 
sented in its multifaceted complexity. Those looking for simple an- 
swers will not find them in this summary—and appropriately so, 
Had individuals responsible for the project been aware of this lit- 
erature, the project could certainly have been improved. But that is 
getting ahead of the story. The remaining four chapters of Part I 
contain the project’s description and findings, some of which are 
given below. 

The subjects were school dropouts, of both sexes from three geo- 
graphic regions of the state of North Carolina. The age range of 
subjects does not appear to be reported. The average IQ of the sam- 
ples appears to be in the 80's. Using a psychometric criterion a large 
majority of the participants could be classified as mentally re- 
tarded. Nowhere are exact numbers involved in the project indi- 
cated, but certain data reported indicate that there were at least 
100 subjects in training in one region of the state, and at least 179 
in another. Thus it is difficult to know how many were involved and, 
therefore, what proportion of the participants are represented in the 
evaluation reported here. Also, data on sex and racial characteris- 
tics are unmentioned. These data are important if the results are to 
be generalized, or if the experiences reported are to be useful to 
others. The subjects received training of 6 to 16 weeks duration in 
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preparing for subsequent employment in pre-apprenticeship car- 
pentry, automobile servieing, and as nurses aides. They were en- 
rolled for six hours of vocational training and two hours of academic 
training daily. 

Prior to and following training the particpants were administered 
the Wechsler Adult Intelligence Scale, the Benton Visual Retention 
Test, The Wide Range Achievement Test, and the Structured Ob- 
jective Rorschach Test. The test data, case study analyses, and 
opinions solicited from a variety of persons form the corpus of the 
evaluation. 

The results of testing revealed some significant changes in I.Q., 
achievement, and personality in individual cases, and the author at- 
tempts, in both empirical and idiographie fashion, to tie these 
changes to post-training status (i.e. unemployment; or employment 
in training related, or training unrelated occupations); and in some 
instances, to regional differences. 

To this reviewer no significant generalizations emerged from 
these analyses, although a variety of interesting and provocative 

hypotheses are suggested concerning regional differences, ability, 
NW and other variables as these are related to employment 
us. 

The personality data seem especially suspect. No reliability data 
оп scoring of the structured Rorschach are provided, and while the 
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for their jobs. Coordinators confirmed the teachers’ self-evaluations, 
and for their part were especially sensitive to bureaucratic red tape 
which hampered the project’s efficiency of operation. Most employ- 
ers, the utlimate source of trainee employment, expressed the view 
that the training was not related to industrial needs. Members: of 
the advisory boards were generally too uninformed to provide eval- 
uations of the programs in their district, although they were consti- 
tuted to provide guidance and involvement at the local level. Most 
educators in communities near the centers were unaware of the exist- 
ence of the Operation Second Chance program. Community residents 
were supportive of the program although critical in some respects. 
In spite of these observations some participants did in fact hold em- 
ployment after completion of the program, and, as indicated pre- 
viously, the author attempts to relate post-training status to par- 
ticipant characteristics, the training program, and regional 
variables. Many interesting hypotheses are suggested here. 

In view of the findings and the manner in which the training was 
conducted it seems legitimate to ask why this book was written. The 
answer to such a question perhaps may lie in the fact that the au- 
thor has wished to present students of the problem with grist for 
hypotheses testing; or perhaps he has wished to present an inter- 
esting case history of the development and implementation of a 
program for school dropouts at the state level. 

As indicated at the outset, this book is a report of evaluation of a 
project. It has pretended to be nothing more. Carefully conducted 
studies which test hypotheses suggested by this report would be of 
great value. 

REGINALD L. JONES 
The Ohio State University 


Preparing for College by Bette J. Soldwedel New York: The Mac- 
millan Co., 1966. Pp. 117. 

This book is a basic primer for parents who are contemplating 
college for their children. It also contains a few asides for their chil- 
dren. The book looks at the college scene from application blank 
(Why go to college) to diploma (How to graduate from college). 
This step by step analysis is facilitated by defining the basic terms 
that are commonly used to describe college life (e.g. credit, dean, 
and liberal arts). Many chapters contain a supplementary reading 
list for those parents who wish to obtain more information about a 
topic. 

^ basic theme that pervades the book centers on the topic of 
why should a person attend college. It deplores the fact that many 
parents see college as а vocational training center. The book does 
stress the idea that acceptance or rejection of an applicant is not 
directly correlated with a value judgment about the child or his 
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parents. It does point out the fact that if any parent really wants to 
send his son or daughter to college he may do so. A combination of a 
variance of college admission policies and vacancies make it a dis- 
tinct possibility that any high school graduate may be accepted for 
admission. (The reviewer was once asked by & mother to find a 
“dumb college” for her none too bright son.) 

A section of the book that is of particular interest to the father is 
the cost of sending a child to college. It explores the financial as- 
pects by looking at such factors as scholarships, borrowing money, 
federal loans, and working your way through college. The issues 
that are raised are reviewed in a realistic manner. 

This book serves as an excellent introduction to college planning 
for parents. It does not attempt to provide a comprehensive coverage 
of the topic. It does review some of the basic factors of which par- 
ents should be aware when they are contemplating college for their 
children. It would make an excellent study for a group guidance 
class for parents. 3 
Henry Kaczowsx1 
University of Illinois 


Approaches to Psychopathology by James D. Page (Ed.) New 
York: Temple University Publication (Columbia University 
Press), 1966. Pp. xii + 304. $7.50. 

In spite of its title, this book does not review the various theoreti- 
cal positions on the remediation of psychopathology. It is a collec- 
tion of 18 essays that point up the relationship between psycho- 
pathology and other disciplines. Although it is generally agreed that 
an interdisciplinary approach is needed to resolve the problems of 
psychopathology, an integrated attack on the matter has not ma- 
terialized to date. Two factors have precluded an interdisciplinary 
attack. First, adequate communication between disciplines is often 
blocked because of the jargon used to express ideas. The second 
factor is that ‘the blend of esoteric interests and skills that struc- 
ture the work in а given discipline prevents investigators from look- 
ab the total situation. The general purpose of the book is to help 
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ond section of the essay reflects topics of personal interests to the 
author. This includes interpretations and conclusions about the the- ' 
oretical conceptualizations about psychopathology. This approach 
permits the reader not only to gain an overview of an area of study 
but also to savor some of the deviations from the accepted modes of 
thinking. Instead of being exposed to a dry recital concerning the 
rubries and rituals, the reader's attention is captured by contro- 
versy. This tactic forces the reader to read each paper with care and 
thereby to increase the chances of gaining information about a 
topic. 

Eleven of the thirteen essays were presented as part of a lecture 
Series for graduate students in psychology at Temple University. 
Papers from the following fields of study are included: anthropol- 
ogy, biology, clinical psychology, existentialism, genetics, law, lit- 
erature, neurology, philosophy of science, psychoanalysis, psycho- 
linguistics, and sociology. The criteria for selection of the papers for 
the lecture series were not given. 

Although the essays tend to be of uniform quality, the reviewer 
found the papers on the philosophy of science, psychological test- 
ing, and neurology to be of special interest. The latter essay has an 
excellent discussion on the neurological approach to anxiety. The 
discussion on the differences between a rational approach to life and 
a logical approach to life presented in the paper on the philosophy 
of science can readily be incorporated in the treatment process of 
psychopathology. 

This collection of essays points up the need to view the problem 
of psychopathology from a multidimensional perspective. It further 
suggests that a periodic dissemination of a collection of interdis- 
ciplinary views would enhance the work of those engaged in the 
treatment of psychopathology. It is hoped that the additional col- 
lection of papers would be of the same quality as the present 
volume. 

Henry KACZOWSKI 
University of Illinois 


The Psychology of Human Behavior by Richard A. Kalish. Bel- 
mont, California: Wadsworth Publishing Company, 1966. Pp. 
viii + 529. 

This book is aimed at serving the needs of an introductory course 
in Psychology directed to students not intending to major in the 
field. The author believes that these students need a textbook that 
will minimize the use of psychological jargon, stress concepts, present 
the most important terminology, integrate research and nonresearch 
writings, deliver a point of view rather than follow an eclectic ap- 
proach, and “develop an understanding of the human personality as 

a possible aid in their future vocational and family roles.” (p. v). 
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The author has produced a book with emphasis on human problems, 
full of relatively simple examples of everyday situations and guide- 
lines to understand all of the factors involved in them. 

The book contents include the following sections: (I) Introduc- 
tion to the Basic Principles of Psychology; (II) Development of 
Human Behavior; (III) The Development and Effects of Stress; 
(IV) Man and his Society; (V) The Healthy Personality; and 
(VI) two appendixes concerned with (i) College orientation, and 
(ii) Study methods. 


in the text. Cartoons are also used as illustrations lending a tone of 
informality to the chapters and delivering an impression of non- 
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ing, which is presented in two and one-half pages, is rarely men- 
tioned in the remainder of the book; the rest of the chapter on learn- 
ing is devoted to paragraphs on problem solving, insight, concept 
formation, thinking, language and thought, creative thinking, and 
massed vs. distributed practice. The chapter ends with a number of 
suggestions on how to study. In twenty pages, the author goes from 
simple conditioning to complex problem solving. 

The next chapter of the first section examines the topics of per- 
sonality development. The author centers the chapter on the self- 
concept and gives several examples of how knowledge of the self- 
concept is essential for the understanding of personality, i.e. “Many 
forms of behavior... make sense if you can learn about the self- 
concept of the people involved.” (p. 94). Traits and roles are dis- 
cussed and the chapter ends with a number of paragraphs on the 
neonate, environment and heredity, and the concept of maturation. 

The chapters in the second section include the earliest years, the 
developing child and his relationships, personality from twelve to 
twenty-one, the student at college, courtship and marriage, and the 
mature years. The general orientation in this section is that of re- 
garding the individual as progressively developing a self-concept 
molded by the nature of his interpersonal relations, mainly those in 
the home and the immediate surroundings. Special emphasis is given 
to significant figures associated with warmth, love, and affection. 
The child’s ability to learn about the world around him depends 
upon many factors including intelligence determined in turn by 
the home, the family atmosphere, language ability, and motivation. 
Personality development is contemplated as self-actualization. The 
author stresses opportunities for self-expression, exploration of the 
environment, encouragement of initiative and expressive play ac- 
tivities. Of the several factors affecting self-actualization, the au- 
thor believes that “healthy parent-child relationships are undoubt- 
edly the major factor in . . . self-actualization in later years" (p. 
153). 

The chapter on adolescence is well illustrated with photographs 
depicting young people at school, at play, and at home. Hight pages 
of this chapter are devoted to dating, the author presenting short 
“ease-history” type of examples on how Jack Soph can be more 
successful in dating Mary Frosh. 

A whole chapter is devoted to discussing the student at college 
regarding the college experience as an opportunity for expanded de- 
velopment of the self-concept, establishing new freedoms, a new 
identity, new relationships, and new responsibilities. 

Another chapter is concerned with courtship and marriage, the 
general thesis being that the more similar marriage partners are, 
the more likely it is for that marriage to be successful. References 
supporting this view are given throughout the chapter. 
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The section ends with a chapter on the mature years; special at- 
tention is given to the problem of the role of women in our society 
and to the problems of old age. 

The third section of the book deals with stress and its effects 
upon human behavior. The three chapters of this section include 
emotions and stress, reactions to stress, and the troubled personal- 
ity. The chapter on emotions and stress is a general and very brief 
review of what is meant by fear, anxiety, frustration, conflict, love 
and affection; a general treatment is given to the types of stress in 
American society such as demands from authority figures, sex-role 
demands, financial pressures, and death and bereavement. No more 
than five references are given in this chapter. 

The following chapter discusses reactions to stress emphasizing 
defense mechanisms; the author also reflects on how stress can pro- 
ho increased motivation, new insights, and new levels of aspira- 

ion. 

The next chapter of the section devotes twenty-five pages to the 
presentation of personality disorders, emphasizing the differences 
between neuroses and psychoses. The main types of neuroses and 
psychoses are discussed in short, paragraphs. The chapter also in- 
cludes a short treatment of alcoholism, drug addiction and psycho- 
therapy. In view of this general and superficial treatment of psy- 
chopathology, it is obvious that the author is not interested in 
reaching readers with a clinical interest. 

_ The rest of the book is obviously directed toward meeting voca- 
tional, social, and academic needs of the college student. One hun- 
dred and fifty-eight pages are devoted to a discussion of carcer 
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It is definitely not methodologieally and research oriented, and it 
delivers a practical, adjustment type of psychology. 

On the side of criticism, the book is open to a wide variety of ob- 
jections. One of the first shortcomings of the text is found in the 
chapter on learning. No mention is made of any of the significant 
learning theories of the last thirty years, and the names of Hull, 
Guthrie, Tolman, and Skinner, for instance, are not found at all. 
The treatment of the basic mechanisms of learning is superficial 
and limited. The reader could not derive from this chapter an ade- 
quate picture of contemporary views on learning. In his treatment 
of motivation and personality, the author uses a hierarchy of needs 
as a set of explanatory concepts. The reader may possibly reach the 
conclusion that a person seeks affection out of a need for love or 
that he behaves in such a fashion because of his self-concept. This 
oversimplification is always a danger in a book aimed at an intro- 
ductory course, since the image may be projected of psychology as 
not basically coneerned with empirically defined terms and with 
the establishment of functional relationships among operationally 
defined variables. 

The book gives answers in a rather definitive tone more than it 
poses questions. The emphasis is neither behavioristic nor clinical, 
which makes it an unlikely choice for most teachers who follow 
either of these orientations. There are too many broad generaliza- 
tions throughout the chapters, and psychology is presented more 
as an artful use of existing answers than as a science dealing with 
controversial issues. The text is informative rather than formative; 
no methodology is outlined and no theoretical issues are presented. 
The prospective user of this book should ask himself whether stu- 
dents of introductory psychology not majoring in the field should be 
given a number of practically oriented notions to help them in daily 
life or be confronted with psychology as a science in the making, 
full of methodological questions, theoretical issues, and contradic- 
tory results. After all, one can always argue that presenting psy- 
chology in this controversial fashion may challenge students enough 
that they may decide to major in this field. 

Gumo BARRIENTOS 
Texas Western College 


Improving Teaching: The Analysis of Classroom Verbal Interac- 
tion by Edward Amidon and Elizabeth Hunter. New York: 
Holt, Rinehart and Winston Inc., 1966. Pp. 221. 

Edmund Amidon and Elizabeth Hunter have provided the pro- 
fession with a rationale, a model, and the tools for modifying 
teacher-pupil interaction, in a way that frees those involved in the 
improvement of teaching relationships from hobbling habit and yet 
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allows them to retain and create those teaching behaviors which are 
effective. 

The authors acknowledge that the book is an adaptation of Ned 
Flanders’ Verbal Interaction Category System (VICS). The cate- 
gorization of the pupil-teacher interaction (VICS) under the head- 
ing of teacher-initiated talk, includes subheadings: (1) gives in- 
formation and opinion (2) gives direction (3) asks narrow question 
(4) asks broad question; under teacher response are listed (5) ac- 
cepts ideas, behavior, or feeling and (6) rejects ideas, behavior, and 
feeling; under pupil response are found (7) responds to teacher 
predictably, unpredictably (8) responds to another pupil; under 
pupil-initiated talk are given (9) initiates talk to teacher and (10) 
initiates talk to pupil; and finally, under other are (11) silence, and 
(12) confusion. 

The form of the book, after explicit definitions for each of the 
twelve categories noted above are given, becomes a kind of work- 
book for the analysis of classroom verbal interaction. Once a stu- 
dent-teacher situation is described, the quality of interaction is as- 
sessed by using VICS. Finally, a skill session is provided whereby 
teachers can make suggestions for altering behavior in the situation 
presented, ultimately generalizing to their verbal interaction abili- 
ties. 

Situations cited in the book take place at a variety of grade 
levels in a broad array of instructional occurrences. The sophistica- 
tion of the authors is apparent by the tight organization of the book 
and the terseness and precision of the concepts presented. In the 
analysis and skills sessions following each teaching example, the 
levels of conversation, the quality and kinds of teacher questions 
(cognitive memory, convergent, divergent, broad or narrow?), and 
the verbal patterns are expanded in such a way as to accommodate 
psycholinguistic theory. 

The ways in which students are asked to react and develop skills 
suggest refreshingly varied activity for teacher implementation 
courses—e.g., roleplaying, “remove the rejection from the following 
statements,” discuss the teacher’s assumptions, which tend to en- 
courage the rethinking of unacknowledged psychological behaviors. 
Thus the teacher is lead via rational self-control to constructs of en- 
lightened, effective, and less punitive classroom relationships. 

Carotyn M. NEAL 
University of California, 
Santa Barbara 


Reading Difficulties: Their Diagnosis and Correction by Guy L. 


Bond and Miles A. Tinker. New York: Appleton, Century- 
Crofts, 1967. Pp. viii + 564, 


In the revision of their earlier edition of Reading Difficulties: 
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Their Diagnosis and Correction, Bond and Tinker have produced 
one of the most exhaustively complete texts for the teacher and 
clinician in the area of remedial reading. The systematic attack on 
reading problems begins with a description of the nature and mani- 
fold causes of reading disability, which is followed by a section of 
specific, analytic, diagnostic techniques to be used with the problem 
reader—a section ultimately developing sequential materials for re- 
medial treatment. 

Evaluation and the directions for use of the various standardized 
tests in reading, informal procedures of diagnosis, and individual- 
ized tests of reading ability are well covered in the section dealing 
with diagnosis of reading problems. Newer methods and systems 
(e.g., ITA, linguistics, and the Dolman-Delacato neurological ap- 
proach to reading instruction) are considered carefully. The text 
accomplishes the task, one that is difficult in the developing and 
expanding field of reading, of being “au courant,” without being 
faddish. T 

Complete and contemporary as the book is, the authors have not 
evaded problems of a rambling, and of a somewhat jargonish and 
dated writing style. The audience for which the edition is intended 
is the practitioner rather than the theorist; thus the writers appeal 
to a group entrenched in the use of "educationese," but not to a 
profession which should be moving away from it. 

For the instructor-practitioner in special classes for the remedial 
reader or in the clinical situation, this well documented guide is 
among the most definitive, comprehensive, and useful. Diagnosis 
becomes facile, and specific treatment of idiosyncratic reading dif- 
ficulty is easily located using systematized procedures. Indeed, 
without over-simplifying, the authors seem to reduce the grossness 
and complexity-of the diagnosis and treatment of reading disability 
to a realistically manageable area for skill instruction. р 

CAROLYN NEAL 
University of California, 
Santa, Barbara 


ERRATUM 

In the article “A FORTRAN Program for Guttman and Other 
Sealogram Analyses" by Roland Werner which appeared on page 
203 of the Spring 1967 issue of this journal the Program was listed 
in error as being &vailable from the Systems Research Committee, 
216 Ostrom Avenue, Syracuse, New York. The Program is avail- 
able from the Vogelback Computing Center, Northwestern Univer- 
sity, Evanston, Illinois 60201 as GUTTSCAL N.U.C. 0115. 
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Охе of the values of Hierarchical Classification by Reciprocal 
Pairs and similar methods (McQuitty, 1966), is their ability to 
classify data without superimposing or forcing the kind of hier- 
archical classification which will emerge. Another is their ability to 
classify almost any set of data without encountering circumstances 
that necessitate arbitrary decisions. 

There is, however, one special condition which does require an 
improvement in order to avoid an arbitrary decision which might 
affect the resulting hierarchical structure. This condition pertains 
to one type of tie in the degree of association between objects to be 
classified into a hierarchical structure. The effects of an arbitrary 
resolution of the tie can be extensive in certain unusual cases, i.e., 
when the tie is for the highest entry in the matrix, and for certain 
other pairs, when the mechanics of the solution provide that two or 
more pairs can be classified simultaneously. 


A Hypothetical Problem 


Suppose that the highest score in a matrix is represented by two 
indices. Suppose that one of these indices mediates between in- 
dividuals I and J and the other between I and K. Then, I is 
highest with J and J, in turn, is tied for highest with I. The 
problem arises, as illustrated in Table 1, because I is also: highest 
with K and K is also tied for highest with I. (On the other hand, 
if the index for individuals IJ is tied with that for any two other 
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TABLE 1 
A Hypothetical Illustration of a Problem with Ties in a Pattern Analysis 


i 70 70 
j 10 
k 70 


— Highest entry in the Column 


individuals, neither of whom is I nor J, there is no problem, as 
illustrated in Table 2. In this event, the reciprocal pair selected 
first is immaterial.) 


A Limited Solution 


When this condition occurs, with which of the two individuals, 
J or K, should I be classified? In many sets ої data, very little 
difference results from arbitrarily classifying I with K rather than 
with J; if I is classified first with J , then K next joins to form the 
triad IJK; and if I is classified first with K, then J next joins to 
form the same triad, IJK. The only difference is at the initial level. 
In the one solution, we generate the triad from Cluster IJ and 
Individual K and in the other we generate it from the Cluster IK 
and Individual J. No subsequent clusters are influenced by whether 
we start with one or the other pair-level cluster, 


An Exception to the Solution 


The triad IJK is obtained from Pair IJ and Individual K if, and 
only if, Pair IJ is highest with Individual K and Individual K, 
in turn, is highest with Pair IJ (IJ = K). Analogously, the triad 


IJK is obtained from Pair IK and Individual J if, and only if 
IK z5 J. 


TABLE 2 
A Hypothetical Illustration of a Tie Which Does Not Cause a Problem 


i 
k 70 
т 


70 
—Highest entry in the Column 
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In order to satisfy both of these two conditions, the association 
between J and K must be higher than the association of any other 
individual with either Pair IJ or Pair IK (a necessary but insuffi- 
cient condition). This requirement is true because the index of 
association between IJ and K (and IK and J) can be no higher 
than the index between J and K. 

Suppose that IJ is selected first. Suppose also that JK is rela- 
tively low compared with the associations between IJ and the other 
elements of the matrix. Under such circumstances, the Cluster IJK 
might never be realized, and Individual K might never enter into a 
cluster with Individual I, even though these two were tied for the 
highest entry in the matrix. Analogously, if the cluster selected 
initially is IK, then Individual I might never enter into a cluster 
with Individual J, even though they yield one of the two highest 
entries in the matrix. In summary, under these conditions, the 
clusters selected subsequent to the initial cluster vary depending 
upon whether IJ or IK is used as the starting point. 


A Role for Assumptions 

The assumptions used in developing Hierarchical Analysis by 
Reciprocal Pairs suggest a solution which is applicable to all ties 
in reciprocal pairs. 

According to the assumptions, every individual is an imperfect 
representative of a pure type. A better representation is obtained 
by classifying an individual with the one other individual who is 
most like him and then determining what is common to the two 
individuals (MeQuitty, 1966). 


A General Solution 


Following these assumptions, a logical solution to the above 
problem of ties is to classify I with J to yield Cluster IJ, and to 
classify I with K to yield Cluster IK. In this approach, we assume 
that Pair IJ is a better representative of a type than is either I or 
J separately; and, analogously, we assume that Pair IK is a better 
representative of a type than is either I or K separately. 

'The nature of the data will determine whether or not K and J 
eventually come together. If they never come together, then I helps 
describe two different clusters at every level of classification. The 
characteristics used to describe one type are not identical with 
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those used to describe the other type. Individual I is in fact a 
member of two different types, each having different combinations 
of characteristics. On the other hand, if any two or more clusters 
do portray identical patterns of characteristics, then the members 
of the two clusters must be combined into one cluster; otherwise, 
they will appear as persistent ties and a solution will not be 
reached. 

Both ties and identical patterns can occur at any level in a 
hierarchical analysis and they can always be handled as described. 


Illustration 


That the general solution is required is illustrated several times 
by a particular set of data. 


The Data 


The data are those gathered in a pilot study designed to help 
plan a sophistieated investigation. In the present paper, we are 
concerned primarily with the technical problem of ties rather than 
a substantive problem. The data are described in relation to this 
technical problem. 

The data were generated by requesting a subject to react to the 
pictures of 20 art objects, by using adjectives which might describe 
them (40 adjectives were used). For each art object the subject 
went through the entire list of adjectives before proceeding to the 
next object. The subject responded by saying, in effect, that the 
adjective is descriptive of the object; that it is not descriptive; or 
that she could not decide whether or not it is descriptive. If the 
subject’s initial response to a picture was positive, she then en- 
dorsed one of three alternative answers: (1) “I like the character- 
istic described by this adjective,” (2) “I do not like it,” (3) “I 
can’t decide whether or not I like it,” 


The Analysis 


The analysis was designed to classify the objects into clusters 
according to the degree of similarity in the reactions to them. 

An agreement score (Zubin, 1938) was computed for every ob- 
ject with every other object. Suppose that there were six adjectives 


and two objects and that the subject reported the following reac- 
tions: . 
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Adjectives: 1 2 E] 45 6 
Object A Yes, Like Yes, Dislike Yes, ? No No Yes, Dislike 
Object B Yes, Like ? Yes,? Yes, Like No Yes, Dislike 


The agreement score between A and B for these six adjectives 
would be four, the agreement being on Items 1, 3, 5, and 6 only. 

Similar computations across all 40 adjectives and among all 20 
objects yielded the 20 x 20 matrix of agreement scores shown in 
Table 3. 

The highest entries in each column of the matrix are underlined. 
The highest entry in the matrix is 34, and reciprocal pairs CF and 
CP each have this value. There are no other tied reciprocal pairs 
at this, the first level of analysis. However, as higher level matrices 
were formed, numerous other ties occurred. 

The data werg analyzed three times, using the method of Hier- 
archical Classification by Reciprocal Pairs (McQuitty, 1966). Each 
time the tied reciprocal pairs were treated differently. In the first 
analysis, where C was highest with both F and P, C was classified 
with F and CP was ignored. In the second analysis, the data were 
ordered so that C was classified with P and CF was ignored. In the 
third analysis, С was classified with both Е and P to produce 
clusters CF and CP. 

In the first two analyses, whenever a tie occurred in the higher 
level matrices during analysis, the first expression of the tied value 
(reading the matrix from left to right by columns and from top to 
bottom within columns) was arbitrarily selected. In the third anal- 
ysis, every reciprocal pair in a tie was used separately to yield a 
cluster. 


Results 


The portions of the resulting hierarchical classifications illustrat- 
ing the greatest discrepancies among the three analyses were se- 
lected for study. Figures 1, 2, and 3, representing the first through 
the third analyses, respectively, show the classifications of eight of 
the twenty objects. 

The discrepancies among the classifications for the eight objects 
used in this illustration are summarized in Table 4, which shows 
all of the clusters derived from the three analyses and the specific 
analysis in which each cluster appeared. Assuming that the third 
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35 


Agreement Scores 


No} Objects: € 


Figure 1. A Hierarchical Classification of Objects to Illustrate a Problem 
with Ties (of the tied values, CF and CP, CF was arbitrarily chosen). Num- 
bers within the structure are agreement scores reporting the degree association 
between the types as the analysis proceeds. 


Agreement Scores 


Objects: C 


Figure 2. A Hierarchical Classification of Objects to Illustrate a Problem 
with Ties (of the tied values, CF and CP, CP was arbitrarily chosen). Num- 
bers within the structure are agreement scores reporting the degree associa- 
tion between the types as the analysis proceeds, 
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Agreement Scores 


Figure 3. A Hierarchical Classification of Objects to Ilustrate a Solution 
to Problems of Ties (wherein C is classified with F and with P because it is 
highest with each of them). Numbers within the structure are agreement 
scores Е the degree association between the types as the analysis 
proceeds, 


TABLE 4 


A Listing of Clusters Derived from all Three Hierarchical Analyses 
and Involving the Eight Selected Objects 


Clusters from Analysis 
all Three a 
Analyses №.1 №2 оз 


S 

2 
АА 
AAKAAAAKK AK AAAAy 
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analysis is valid, then Analysis No. 1 was defective because it 
yielded only 3 of the 11 correct clusters, Four of its 7 clusters were 
incorrect. Analysis No. 2 was similarly defective. Furthermore, 
Analyses 1 and 2 have only one cluster in common, the final one. 


Substantive Interpretation. 


Although this paper is concerned primarily with a technique for 
analyzing data, the reader might have some interest in a sub- 
stantive description. The matrix of this study was generated by 
one person. The interassociations between objects were generated 
by her reactions exclusively. In isolating the hierarchical structure 
we are in fact analyzing a mental structure of a single individual. 
Another example of a similar approach (in the field of Political 
Science) is an analysis of Supreme Court decisions made by Justice 
Robert H. Jackson (Schubert, 1965). 

The description of a cluster might be interesting. For substantive 
description, we have chosen Cluster MICFP, the largest cluster 
obtained in Analysis No. 3 (and not in either Analyses 1 or 2). 

The objects of this study were sterling silver spoons of various 
designs. The subject gave the following reactions to all five spoons 
of Cluster MICFP: 


Tasteful Yes, Like Segmented No 
Overdone No Busy No 
Characterless No Dark No 
Abrupt No Consistent Yes, Like 
Flared No Redundant No 
Droopy No Plain Yes, Like 
Feminine Yes, Like Swirling No 
Subtle Yes, Like Lumpy No 
Lasting Yes, Like Aggressive No 
Asymmetrical No Unfriendly No 
Rambling No Restrained Yes, Like 
The above analytical description can be composed into the follow- 
ing prose: 


These five spoons were liked for the physical characteristics of 
plainness of design. Also, they were thought to look restrained, 
tasteful, consistent, subtle, lasting, and feminine, all of which ap- 
pear to be psychologically or sociologically determined character- 
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istics. It seems reasonable to regard this assembly of positive re- 
actions as a consistent cognitive organization. In addition, the 
“No” reactions to the other adjectives in this cluster do not include 
any reactions that seem inconsistent with the positive ones. In 
fact, these “No” reactions can help to clarify the meaning of some 
of the positive reactions. For example, the spoons look restrained, 
but not characterless; tasteful, but not overdone; plain, but not 
asymmetrical, segmented, flared, rambling, swirling, or lumpy. 

Although it appears that this subject liked these five compara- 
tively plain spoons, little else can be concluded without an analysis 
of the differences among the various clusters, 


Summary 


This paper develops a generally applicable solution for ties 
among reciprocal pairs in pattern analytic methods. The solution 
is illustrated in the analysis of art objects using the reactions to 
them by a single subject; she reacted to each of 20 objects, each 
evaluated 40 times according to a set of 40 adjectives. Agreement 
scores between the objects were computed over the 40 adjectives 
and then pattern analyzed. The grouping derived from the analyti- 
cal solution to the problem of ties appeared to be meaningful. 
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A NOTE ON DESIGN OF TEST EXPERIMENTS 


H. G. OSBURN 
University of Houston 


Tus theory of generalizability distinguishes between experiments 
in which items are matched and experiments in which items are 
unmatched (Cronbach, Rajaratnam and Gleser, 1963). These 
writers have shown that the unmatched item design leads to differ- 
ent reliability formulas for estimating the generalizability of a 
test. This paper shows that the unmatched item experiment has 
certain advantages over the usual matched item experiment de- 
pending upon the objectives of the study. 

Items are said to be matched if for every item in the test there 
is an ты for every person in the sample of examinees. Items are 
said to be unmatched when the items are selected randomly and 
independently for each person. 

Most studies using psychological tests have used the matched 
item design, and it might seem that unmatched item designs would 
be something to avoid. This is definitely not the case. It turns out 
that unmatched item designs have important applications in test 
experiments as will be shown below. 

Suppose that we are sampling from a population of persons, P, 
and a population of items, J, and we wish to generalize over both 
populations. Consider the following experiments. 


(1) K items are randomly selected from population J, and ad- 
ministered to one individual from population P. We wish 
to estimate t; where $, = Ех»), the expectation of z, over 
all samples of К items, and z, = >); 2,,/K. 

(2) K items, randomly selected from population J, are administered 
to examinees from population P. We wish to estimate ($, — £2) 
for every pair of examinees p and q. 
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(8) K items, randomly selected from population J, are administered 
to each person in a sample of N examinees randomly selected 
from population Р. We wish to estimate { where { = E»;(z,;) = 
Е). : 

(4) К items, randomly selected from population Г, are administered 
to each person in a sample randomly selected from popula- 
tion P,, and to each person in a sample randomly selected 
from population P}. We wish to estimate (f, — &). 

These four experiments are representative of inferences that we 
would like to make from test data. In each experiment our objective 
is to estimate a population parameter. The question asked in this 
paper is which experimental design provides the most efficient esti- 
mate of the parameter—a matched item (MI) design where each 
examinee takes the same set of K items or the unmatehed item 
(UMI) design where the K items are selected independently for 
each person. 

Case 1. In the situation where one wishes to estimate С for a 
single examinee it doesn’t matter whether the MI or the UMI 
design is used. In either case the squared standard error of z, is 
given by 


VAR, @) = 501 — t)/K. (1) 


Case 2, When we are interested in difference scores it can be 
shown that the MI design is more efficient since 


VAR; (2, — 2.) = VAR; (z) + VAR (4) — 200V; (en 2). (2) 
When UMI data are collected COVi(252,) = 0, but for MI data 
COV; (2,27) is generally greater than zero. 

Case 8. In the situation where one is interested in estimating the 
mean true score over a population of examinees it can be shown that 
the UMI design is more efficient. Lord (1955, equation 52) has shown 
that the sampling variance of Z for the situation in which items and 
persons are sampled simultaneoulsy can be expressed by 


VAR»: (2) = E,[VAR;.r (2)] + VAR? [E;.r(2)] (3) 


where VAR;.p (2) is the sampling variance of 2 over samples of 
items for a fixed sample of examinees, and Z;.» is the expected 
value ofz over all samples of K items for a fixed sample of examinees. 

Working with the last term on the right of equation (3) it is seen 
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that H;.(2) = Z so that ; 
VAR» [E;.r(@)] = ej /N. (4) 
Working with the first term on the right in equation (3), since 
z= УМ, 
VAR» @ = [X VAR (6) + ZE COV: „гум. (5) 


p 
When the items are unmatched so that K items are sampled inde- 
pendently for each person COV;(z,2;) = 0, and equation (5) re- 
duces to 


VAR; © = Y; VAR, (@)/N? (6) 
and since VAR; (z) = t,(1 — &,)/К, 
VAR» @ = dg Z- 8? — 2) @ 


where S? = 55, ($, — Z)?/N, and Z = У), ¢,/N. Taking the 
expected value of equation (7) over all samples of N persons, we 
obtain 


Er [АЁ р.» (2) = [$ — (М — Do /N — of /N — PVNK (8) 
since § = Ер(Я,), and Hp(Z’) = ej"/N + 2, thus 


Е, [АЁ » (8)] = [Fl — £) — «ИМК. (9) 
Therefore, from equations (3), (4) and (9) it is seen that 
VAR; Q) = [F(1 — 9 — er /NK + o /N (10) 
which reduces to 
VAR» @ = [$1 — £) + (К — De; ]/NK. (11) 


Lord (1955, equation 65) has shown that (when the items are 
matched) 


VAR. (2) = [1 — §) + (К — De? + (М — 1)o,"/NK — (12) 


where т; = Иь(р;), p: = У, ть/М. Therefore, we have the following 
relation 


VAR»; (2) = VARpr (@)ums + (N — 1)e^/NK. (18) 


Thus, the sampling variance of 2 over samples of persons and items 
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is smaller for the UMI design as compared with the MI design 
by the amount (N — 1)c,"/NK. It is apparent that the discrepancy. 
between УАР; (2),; and VAR»; (Z),,; decreases as К increases, 
and increases as X increases. УАР (2),„: tends toward zero as 
N increases while VAR»; (2)„; tends toward c,^/K. The implication 
is that it is possible to obtain a very precise estimate of { by ad- 
ministering a few items to a large number of examinees using the 
UMI design. T 

Case 4. Here, we are interested in estimating the difference 
between the mean true score of population P, and the mean true 
score of population Р,. It ean be shown that for this situation 
VAR»; (2 — %)mi will be larger than VAR»; (2 — Zumi when 
N is large, although the two sampling variances will be close to 
the same value in most applications. By the formula for the variance 
of a difference we have к T 


VAR»; (à IT E] = VAR»; E) np VAR» (2) — 2007; (2, 2). 
(14) 


In the UMI experiment it is clear that COV»; (Z,, 22) = 0, so that 
VAR»; (2, — %)uni 


Ey аа a й) M ыа — f) + K- De? et ов ИМК. (15) 


In the MI design COV pr (а, 22) is greater than zero. Lord (1965, ' 
equation 12.1) has shown that (for the MT design) 


COYV »; (Zu 2) = COV (m, т) / К (16) 


where cov (т, тз) is the covariance over all items between the 
true item difficulties for one population of examinees and the true 
item difficulties for the other population. "Therefore, from equations 
(12), (14), and (16) 


VAR»; @ — 2), = [BU — E) + &а — E) + K — Uo? + o) 
+N 10,2 + 0..9) — 2NCOV (m, =) ИМК. (17) 
Comparing equations (15) and (17), it is seen that 
VAR», @ — а): = ҮАВь (& — 2), 
TIN — 16.0.2) — 2 МСОТ (m, v)]/NK. (18) 
Thus, which design is more efficient; in the case of estimating the 
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difference between the mean true scores for two populations depends 
.upon the value of COV (a, тз). 
For the situation where o,,” = o,,”, the last term in equation (18) 
can be expressed as 
2c [N(1 — р.,„,) — 1/NK. (19) 


Tt is seen that equation (19) is positive provided that N(1 — p.,..) 
is greater than 1. This will be true if N is reasonably large even 
if the correlation between the true item difficulties in the two popula- 
tions is quite high. For example, with р.,., = .90 equation (19) 
will be positive if М > 10, or with p,,., = .99, the term will be 
positive if N > 100. Thus, it is concluded that with reasonably 
large N the UMI design will usually be more efficient than the 
MI design for estimating the difference between the mean true 
scores for two populations. 

The implication of this paper is that more attention should be 
given to the unmatched item technique in the design of experiments 
involving psychological tests. Of the four cases investigated, the 
MI design was clearly superior only for generalizations regarding 
individual difference scores. This implies that comparisons of treat- 
ment methods, ability surveys, etc., would ordinarily be better 
served by the UMI design. 

There have been three recent papers suggesting the possibilities 
of unmatched studies. Cronbach (1963) has advocated administer- 
ing different questions to different students in discussing the mea- 
surement of proficiency for course improvement and curriculum 
evaluation. Lord (1962) has published an empirical study showing 
the possibilities of collecting unmatched data in estimating a norm 
distribution. Also Lord (1965, Section 13) has recently presented 
results showing that the group mean on a finite pool of items for a 

% very large group of examinees will be best estimated by administer- 
ing a different item to each of K nonoverlapping samples of N 
examinees, rather than by administering all K items to a single 
sample of № examinees. 

These results and those presented in this paper point up the 
need for a careful analysis of the objectives of test experiment, 
and for a more general recognition that the fixed test model may 
not be the most efficient model to use in several important psy- 
chometric applications. 
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THE INVARIANCE OF CERTAIN ABILITY FACTORS: 


MOHAMMED Y. QUERESHI 
Marquette University 


Тнв problem of invariance of factor solutions is especially im- 
portant to the development of theories in those areas of psychology 
where factor analysis is regarded as an appropriate method of 
investigation. Generally in such contexts, invariance is assigned 
one or more of the following meanings: (a) consistency of factors, 
for a fixed set of variables, from one sample of individuals to 
another, (b) consistency of factors, for a fixed sample of indi- 
viduals, from one set of variables to another, (c) consistency of 
factors when both samples and variables are methodically varied, 
and (d) consistency of factors, for a fixed sample of individuals 
and a fixed set of variables, from one method of analysis and/or 
rotation to another, 

While mathematical developments showing the relationship 
among different factorial solutions to the same set of data (see the 
last of the aforementioned meanings) may seem somewhat trivial, 
such solutions for the remaining interpretations of invariance are 
hard to come by. So far satisfactory statistical solutions have been 
offered only for certain special cases under (b) above (Kaiser, 
1958; Meredith, 1964), whereas several investigations (Barlow 
and Burt, 1954; Burt, 1948; Leyden, 1953; Young and House- 
holder, 1940; Zachert and Friedman, 1953) concerning one or more 
of the other meanings of invariance have consistently remained 
empirical in nature. While adequate statistical solutions may be- 


puma enl; : 
1 Acknowledgment is due to John Cockayne for assisting with the computa- 


tional work which was done on the IBM 7040 computer located on the 
Campus of Marquette University. An earlier version of this paper was read . 
at the Special Scientific Meeting of the Psychometric Society at New York 
University on September 2, 1966. 
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come available sometime in the future, attempts to provide em- 
pirical evidence regarding the first three aspects of invariance, at 
least at the present, will have to suffice. 

The current study was designed to obtain empirical evidence 
concerning two interpretations of invariance, those represented by 
(a) and (b) of the foregoing, with reference to one general and 
three group ability factors extracted in connection with a previous 
investigation (Quereshi, 1967) which demonstrated the existence 
of certain patterns in the cognitive development of children be- 
tween two and ten years of age. 


Method 


The data were based on 700 subjects (Ss) divided into seven age 
groups of 100 each between two years, four months and nine years, 
two months, with equal numbers of children taken’ from each sex. 
Detailed information about the parametric characteristics of Ss, 
the nature of the test materials, and procedures for testing, etc. is 
available elsewhere (Quereshi, 1964; Quereshi, 1967). The follow- 
ing description of the method for treatment and analysis of data 


begins at the point where the previous study (Quereshi, 1967) 
left off. 


Treatment and Analysis of Data 


In the previous study (Quereshi, 1967) one general factor, A, 
and three group factors, B, C, and D, were extracted from each of 
the seven 14 X 14 correlation matrices, representing seven different 
samples (age groups), by means of a modified square root method. 
Since four of the 14 variables in each of those matrices were con- 
stituted from a combination of three or more of the remaining 
ten Variables, it became necessary for the purposes of the present 
study to exclude them from any consideration in determining the 
congruence for factors A, B, C, and D across different samples. 
The other ten variables consisted of nine subtests of the Illinois 
Test of Psycholinguistic Abilities (ITPA)? and the Stanford-Binet, 


2 The ITPA consists of the following subtests: Auditory Vocal Automatic 
(AVAu), Motor Encoding (ME), Auditory Vocal Association (AVA), Visual 
Motor Sequencing (VMS), Auditory Vocal Sequencing (AVS), Visual Decod- 


ing (VD), Visual Motor Associatio: i i 
a eal Encoding (VE). п (УМА), Auditory Decoding (AD), and 
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MA, and were utilized in carrying out the necessary analyses for 
attaining the objectives of the present study. 

In order to determine the similarity of the factors across the 
seven different samples (see the first, meaning of invariance), co- 
efficients of congruence (Tucker, 1951; Wrigley and Neuhaus, 
1955) were computed for each of the four factors across the seven 
age groups utilizing the aforementioned ten variables. 

For the second part of this study (namely, the investigation of 
the stability of factors for the same sample but across different 
sets of variables), efforts were concentrated on only the general 
factor, A, which was constituted in four different ways, each repre- 
senting the same method of forming and extracting factors but 
each consisting of one variable less than the preceding one. Thus, 
the first version of factor A in this second part of the present study 
was identical to factor A in the first part in all respects except for 
the percentages of variance attributable to the former for each of 
the seven samples—a difference chiefly due to the fact that the 
former was extracted from an 11 X 11 matrix instead of a 14 x 14 
matrix. For each of the subsequent three versions, one, two, or three 
of the variables, other than the Stanford-Binet MA, were dropped 
from the matrix, on a random basis, before extracting the corres- 
ponding version of the general factor, A. Hence, the subsequent 
general factors, Аз, Аз, and As, were respectively extracted from 
10 х 10,9 x 9, and 8 Х 8 matrices in which one of the variables 
was the corresponding hypothesized general factor itself, con- 
stituted and extracted in a manner identical to factor A in the 
previous study (Quereshi, 1967). Accordingly, factors A, Аз, Аз, and 
Аз consisted of ten, nine, eight, and seven variables respectively. 

For the purpose of estimating similarity among factors A, Aj, 
А», and As, while holding the sample constant, factor scores for each 
of the 100 Ss comprising a sample were computed for each of the 
four versions of factor A and for all seven samples separately, 
utilizing the commonly available techniques (see Harman, 1960, 
pp. 340-341). The Pearson rs were then calculated among the four 
different versions of the general factor for each of the seven samples 
separately.® 
_ Although the coefficients of congruence as suggested by Wrigley and 


Neuhaus (1955) were also computed, they are not reported here since not 
much is known about their sampling distribution. These coefficients usually” 
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It may be mentioned here parenthetically that in the second part 
of this study it was not possible to investigate the stability of 
factors B, C, and D since these faetors could not be constituted in 
accordance with the accepted rationale (Quereshi, 1967) whenever 
any of the nine variables (representing the ITPA subtests) were 
excluded from a correlation matrix.* 


Results and Discussion 


The results concerning the first part of this study (1.е., the sta- 
bility of factors A, B, O, and D across seven different samples for 
the same ten variables) are presented in Tables 1 and 2. Tho co- 
efficients of congruence for factor A are presented above the di- 
agonal, while those for factor B are given below the diagonal in 
Table 1. Similarly, in Table 2, the data above the diagonal provide 
evidence regarding the cross-sample stability of factor C, whereas 
those below the diagonal have bearing on the generality of factor 
D across the seven given samples. These coefficients for the re- 
spective factors range as follows: .917 to .980 for A, .896 to .994 
for B, .733 to .980 for C, and 921 to 985 for factor D. The 
medians of these coefficients for factors A, B, C, and D are .960, 
968, .947, and .957 respectively. Except for six coefficients (out 
of a total of 84) that range between .733 and .792, indicating 
relatively low stability of factor C for just one sample (3 7 repre- 
senting the 814-9 age group), all the remaining 78 coefficients are 
very close to unity. Therefore, it would be justifiable to conclude 
that generally all of these four factors possess a high degree of 
stability across the given samples, Thus, the decision made in the 
previous study (Quereshi, 1967) to interpret these factors as rep- 
resenting approximately the same types of cognitive characteristics 
at different age levels appears, in retrospect, to be justified. 

The second interpretation of invariance as investigated here 


и Ae ii 
run much higher than the Pearson 7s and, in th tween 
987248 and .999965, with a median of :909509. mos n 

The loadings of the respective constituent variables, on each of the four 
versions of factor A, are reported in Tables A through D which have been 
deposited with the American Documentation Institute, Order Document No. 
9582, remitting $1.25 for 35-mm. microfilm or $125 for 6 by 8 in. photocopies. 

*The variables excluded from Аз, As, and As respectively are VMS, AVS 

ч апа АР, and АУА, VMS, and VD. Since the variables were excluded randomly 
and with replacement, the same variable(s) may be excluded more than once. 
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TABLE 1 


Coefficients of бле across Seven Samples for Factor A 
(above the Diagonal) and Factor B (below the Diagonal) 


Samples (Age Groups) 
3 4 


Samples 1 5 6 7 

T 979 923 - 959 970 944 955 
2 981 959 974 964 950 967 
3 982 986 976 966 917 939 
4 984 982 994 980 960 965 
5 948 940 982 981 962 956 
6 943 948 977 968 979 957 
7 896 915 931 908 910 939 


Note.—Decimals are omitted; all coefficients are reported up to three decimals. Samples 1, 2, 3, 
etc, represent age groups 21-3, 3}-4, etc. up through 83-9 years. 

"The loadings of factors A and B on the variables involved, for each of the seven samples, have 
been reported elsewhere (Quereshi, 1967). 
deals with the stability of a general factor, based on four different 
sets of variables, for a fixed sample of Ss. The results concerning 
this issue are incorporated in Table 3 which presents the Pearson 
rs among four versions (A, Ay, Аз, and Аз) of the same factor, each 
based on a different set of variables, for each of the seven samples 
(age groups). Also given in Table 3 are the percentages of variance 
across the seven age levels, the Spearman rhos between the hypoth- 
esized and obtained ranks for the percentages of variance, and Ps 
for the rhos for all four versions of the general factor, A. Since the 
main concern here is with the consistency of factor A across differ- 
ent sets of variables, discussion pertaining to percentages of vari- 
ance, etc. is postponed till later. 


TABLE 2 


Coefficients of Congruence across Seven Samples for Factor C 
(above the Diagonal) and Factor D (below the Diagonal)* 


Samples (Age Groups) 
3 4 5 


Samples 1 6 7 

b 943 915 930 970 935 789 
2 940 970 951 962 967 787 
3 932 954 947 961 975 749 
4 981 957 969 959 980 761 
5 962 926 963 985 959 733 
6 955 923 957 983 981 792 
7 956 905 921 976 981 968 


Note.—Decimals are omitted; all coefficients are reported up to three decimals. Samples 1, 2, 3, 
etc. represent age groups 24-3, 33-4, etc. up through 8}-9 years. 

*The loadings of factors C and D on the variables involved, for each of the seven samples, have, 
been reported elsewhere (Quereshi, 1967). 
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TABLE 3 


Correlations among the Four Different Factor Solutions and Percentages of 
Variance for the Various Age Groups Pearson Correlation (r) and 
е of Variance (p) 


"Age Groupe horario ih ТОВЫ ЕТЕДЬ үр сн. 
21-3 983 .941 .961 .980 .984 .955 38.49 41.92 42.66 40.35 
34-4 997 .983 .952 .977 .960 .930 35.03 34.82 39.76 32.92 
41-5 980 .968 .973 .957 .964 .916 27.83 26.60 27.82 25.94 
53-6 991 .992 .954 .986 .948 .955 27.86 28.42 29.58 26.91 
63-7 984 .949 .988 .961 .979 .924 27.98 29.27 27.65 29.25 
71-8 996 .993 .982 .989 .992 .972 24.10 26.80 28.45 29.09 
819 .. 939 .928 .978 .966 .901 .871 23.43 22.69 27.77 22.06 

Spearman’s rho .857 .750 .750 .679 
PS 02 .05 .05 .06 


Е 

Note.—The symbols a, b, с, and d respectively represent the four versions, A, Ai, Аз, and As, of 
the general factor. Each r is based on an N of 100, The Spearman rho represents the correlation 
between the hypothesized and the obtained ranks for the percentages of variance for the seven 
- groupe for each of the four factor solutions; P indicates the level of significance for the respective 
tho. i 


An examination of the intercorrelation among the four versions 
of the general factor, for each of the seven samples, indicates that 
the estimates range between .871 and .997, with a median of 
.970. Of all the 42 coefficients only one (.871) is not in the nineties. 
On the basis of this evidence, it can be safely concluded that the 
general factor, as conceived in the present study, remains highly 
Stable even when about one-third of the constituent variables are 
dropped from the analysis. 

It may be appropriate here to make a few remarks about the 
trend in the relative prominence of factor A and its three other 
versions at the seven age levels. In a previous study (Quereshi, 
1967) it was demonstrated that the general factor, A, gradually 
decreased in importance as the age of Ss increased from two to ten 
years. The percentages of variance for the various versions of the 
same general factor, as reported in Table 3, further corroborates 
the validity of an age differentiation hypothesis since the rhos, 
representing the correspondence between the hypothesized and ob- 
tained ranks for the percentages of variance across the given age 
levels, are significant at a fairly reasonable level. 


Evaluation and Implications of the Findings 


More important than the adequacy of the evidence presented for 
the stability across samples of factors A, B, C, and D, in the first 
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part of the present investigation, is its implication that the authors 
of individual or group tests of ability, designed for various age 
levels or samples, should furnish empirical data on the cross-sam- 
ple congruence of factors purportedly measured by their instru- 
ments. The fact that factor C fails to reach, albeit for one sample, 
the level of consistency normal for the other three factors, further 
enhances the weight of the foregoing argument. Notice may also 
be taken of the fact that inspectional comparisons as employed in 
some previous studies (see Zachert and Friedman, 1953, for ex- 
ample) need no longer be accepted as adequate. 

The importance of the findings in the second part of the present 
study is somewhat attenuated by the facts that (a) the number of 
variables in each of the four sets was not the same and (b) the 
variables constituting the general factor, in each subsequent anal- 
ysis, were not irhbedded into a set of completely new variables as 
one would ideally like to do. However, these limitations need not 
obscure the implication that test authors and publishers, whenever 
they claim that a shortened form of a test battery may be em- 
ployed to secure estimates equivalent to those based on the full- 
length battery, should provide necessary evidence demonstrating 
the congruence of the underlying dimensions in addition to the 
commonly presented data on means, standard deviations, etc. for 
the corresponding short and long versions. 


Summary 


The present study was designed to investigate empirically the 
invariance of certain ability factors with respect to two different 
Meanings: (a) consistency of factors, for a fixed set of variables, 
from one sample of Ss to another and (b) consistency of factors, 
for a fixed sample of Ss, from one set of variables to another. 

The data represented the performance of 700 children, compris- 
ing seven independent samples of 100 Ss each, on the ITPA and 
the 1937 Stanford-Binet. The results clearly indicate that the fac- 
tors as constituted in the present study possess a high degree of 
stability within the context of both of the above-mentioned inter- 
pretations of invariance. An attempt was made to point out the 
implications of these findings for certain procedures of test con- 
struction and analysis. 


810 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
REFERENCES 


Barlow, J. A. and Burt, C. The Identification of Factors from 
Different Experiments. British Journal of Statistical Psycho- 
logy, 1954, 7, 52-56. 

Burt, C. The Factorial Study of Temperamental Traits. British 
Journal of Psychology, Statistical Section, 1948, 1, 178-203. 
Harman, H. H. Modern Factor Analysis. Chicago: University of 

Chicago Press, 1960. 

Kaiser, H. F. The Varimax Criterion for Analytic Rotation in 
Factor Analysis. Psychometrika, 1958, 23, 187-200. А 

Leyden, T. The Identification and Invariance of Factors. British 
Journal of Statistical Psychology, 1953, 6, 119. 

Meredith, W. Notes on Factorial Invariance. Psychometrika, 1964, 
29, 117-185. 

Quereshi, M. Y.-Performance on Individual Ability Tests as a 
Function of Various Scoring Cutoffs. EDUCATIONAL AND PSYCHO- 
LOGICAL MEASUREMENT, 1964, 24, 481-512. 

Quereshi, M. Y. Patterns of Psycholinguistic Development during 
Early and Middle Childhood. EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT, 1967, 27, 353-365. 

Tucker, L. R. A Method for Synthesis of Factor Analysis Studies. 
Personnel Research Section Report, No. 984. Washington, 
D. С.: Department of the Army, 1951. 

Tucker, L. В. An Inter-battery Method of Factor Analysis. Psy- 
chometrika, 1958, 23, 111-136. 

Wrigley, C. and Neuhaus, J. O. The Matching of Two Sets of 
Factors. Contract Memorandum Report, A-32. Urbana, Il- 
linois: University of Illinois, 1955. 

Young, Gale and Householder, A. S. Factorial Invariance and Sig- 
nificance. Psychometrika, 1940, 5, 47-56. 

Zachert, V. and Friedman, G. The Stability of the Factorial Pat- 
tern of Aircrew Classification Tests in Four Analyses. Psycho- 
metrika, 1953, 18, 219-224. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1967, 27, 811-820. 


ON SUBJECTIVITY IN FACTOR ANALYSIS 


JOHN L. HORN 
University of Denver 


Rotation to simple structure in factor analysis was devised by 
Thurstone ( 1935) as a means for solving the indeterminacy prob- 
lem. In the usual factor analysis based upon a matrix of inter- 
correlations there is an infinite number of factor solutions, each 
involving the same number of factors and each allowing reproduc- 
tion of the correlation matrix with the same degree of accuracy. A 
simple structure solution is one of the infinite number of possible 
solutions, but one which, for the various reasons Thurstone gave, 
might be expected to be more interpretable, more replicatable and, 
in general, more useful in scientific work than any other solution 
(at least for some kinds of substantive problems). Thurstone 
viewed rotation, then, as a means of making factor analysis a 
more useful research tool. 

The various solutions obtainable by use of standard procedures 
for calculating factors represent different criteria of rotation, as 
Thurstone (1947) himself demonstrated. The principal axes solu- 
tion, for example, is a rotation to a set of factors such that the 
first accounts for the maximum possible linear covariation among 
the variables; the second accounts for the maximum possible of 
such covariation after that accounted for by the first factor has 
been removed, ete. In a traditional bi-factor solution the first fac- 
tor is rotated in much the same way as in the principal axis solu- 
tion, but subsequent factors are rotated in a way to obtain large 
loadings for a small number of variables and near-zero loadings 
for the remaining variables. In a Tryon cluster analysis all factors 
are rotated in this latter manner. In any other so-called “direct” 
solution there is, similarly, an implicit rotational criterion. 


811 
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An interesting feature of simple structure is that it proposes a 
model which seems to have wide usefulness in the behavioral sci- 
ences. The model stipulates that variables represent relatively few 
influences and influences operate with respect to relatively few 
variables. Obviously, such a model is applieable only in studies 
where observations are sampled in such a way that simple structure 
ean inhere in the data. But it seems that such sampling can be 
carried out in research on many problems and that, in fact, it is 
often desirable to formulate substantive issues in this way. Yet if 
a simple structure solution is to represent something more than the 
wishful thinking of an investigator, it must be evident in the data; 
not be a condition imposed by the investigator. It is difficult to 
detail the implications of this requirement, but in general it means 
that а valid simple structure solution must be compellingly differ- 
ent from one that could result by chance. The present study aims 
to help clarify this general statement. 


Classification of Rotational Procedures 


It will be useful to cross-classify rotational procedures as pri- 
marily either objective or subjective and primarily either analytic 
or judgmental. This is useful because there are two somewhat dif- 
ferent meanings of the term “subjective” as it pertains to rotation 
in factor analysis and it is desirable to keep these meanings 
separate, 

Analytic procedures are those based upon explicit mathematical 
conditions which, once set, fully determine a solution. An example 
is the varimax procedure (Kaiser, 1959). In using this, one makes 
a very simple judgment about the termination criterion, but this is 
quite explicit and, once set, the solution is fully determined. Sim- 
ilarly maxplane (Cattell and Muerle, 1960; Eber, 1966) is mainly 
analytic, although in using this one must set several constants 
(hyperplane width, minimum angle between vectors, ete., as well as 
a termination criterion) before the solution becomes fully deter- 
mined. 

In what are here defined as judgmental procedures, on the other 
hand, the conditions which determine solution are not made fully 
explicit. One makes Judgments about configurations of points, about 
angles between vectors, etc., but these judgments are not specified 
in terms of numerical constants or functions to be maximized or 


JOHN L. HORN 813 


minimized. Visual rotation by means of bi-variate plots is an ex- 
ample of a judgmental procedure. A bi-factor or cluster analytie 
solution may be judgmental if in forming variable groupings, the 
rotator depends upon judgments about which variables belong in 
different groupings, rather than by using an explieitly stated value 
of a B-coefficient (or the like) to specify belongingness. 

An objective procedure may be defined as one in which the in- 
vestigator's theory does not enter into the actual determination of 
solution. Here it is necessary to distinguish between the subjective 
choice one makes prior to analysis in selecting a partieular kind of 
solution with which to test hypotheses—as when an investigator 
elects to seek a simple structure solution, say, in lieu of a principal 
axes solution—and the choices one can make after he has looked 
at results from his analysis—as, for example, when, in rotating by 
visual means, one informs himself about the variable names for 
the points in a plot and rotates in a way to make the solution 
“meaningful.” The former kind of choice is here regarded as not 
detracting from the objectivity of a rotational solution, whereas 
the latter kind of choice indicates subjectivity. 

When rotating by visual means one may studiously avoid know- 
ing the names of the variables which points represent and solution 
may be determined exclusively by consideration of the configura- 
tion among points. Such is what Cattell (1952) has described as 
“blind” rotation and what is here termed objective, even though 
the solution is arrived at by a judgmental process. On the other 
hand, clusters in cluster analysis and groupings in bi-factor anal- 
ysis may be defined, in part at least, in a way to “make sense” 
(i.e. in terms of a favored theory) and it is rotation of this kind 
which is here referred to as not objective—i.e., subjective. Other 
examples of subjectivity in rotation are the square-root procedure 
when employed to put factors through variables said to represent 
concepts of principal concern (ef. Osgood, Suci and Tannenbaum, 
1957 in defining semantic differential dimensions) and criterion 
Totation (Eysenck, 1958) in which a factor is rotated into align- 
ment with a particular variable defined in substantive theory as 
“the criterion.” 

But subjectivity in this sense is not confined to judgmental pro- 
cedures: in what is called transformation analysis (Ahmavarra, 
1954) and the Procrustes procedure (Hurley and Cattell, 1962), 
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the rotator very explicitly specifies several numerical constants— 
viz, those indicating the desired solution—and employs equally 
explicit minimization functions, but the procedures are subjective 
because they guarantee a maximum possible (by a least-squares 
criterion) agreement between the investigator’s hypothesized solu- 
tion and that obtained. Various other rather widely used factor 
analytic procedures involve subjectivity in this sense (cf. Burt, 
1941; Mosier, 1939; Rogers, 1957). 

The concern in this study is with subjectivity in these last- 
mentioned senses of the term. The basic question asked is: “What 
confidence can we have in a factorial solution which is arrived at 
by means of a subjective rotational procedure?” The tact taken 
to provide some insight into this problem is one of determining the 
extent to which meaningless variables can be found to yield up a 
“meaningful” factorial solution when such procedures are used. 


Procedures 


Seventy-four random variables, each based upon an N of 300 
and drawn from a normally-distributed population of real num- 
bers, were separately generated. The variables were given names 
that corresponded with those for various ability and general per- 
sonality variables with which the investigator had recently worked 
(see Horn and Cattell, 1966). The intercorrelations for the 74 
variables were obtained and the resulting matrix was initially fac- 
tored by a principal axis procedure. The number of factors was 
estimated at 21 and factoring proceeded iteratively to determine 
communalities such that all on one calculation differed by less than 

‚01 from corresponding communalities obtained on the immediately 
following ealeulation, This tended to ensure that factor loadings 
would not be large, 

An hypothesis factor matrix was defined on the basis of a theory 

and results obtained in the studies mentioned above. If a variable 
given a particular name was supposed to help define a given factor, 
it was given a non-zero loading—2, 3, 4, .5 or .6—in the hypothe- 
sis matrix; otherwise zero was recorded. The hypothesis matrix 
contained 21 factors so as to conform in this respect with the 
matrix determined in analysis. Some of the factors in the hypothe- 
sis matrix were “over-determined” (cf. Thurstone, 1947)—there 
being six or more variables specified in the factor—whereas in 
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factors three or fewer variables were involved, these being 
ler-determined” by usual standards. 

The Hurley-Cattell (1962) Procrustes procedure was used to ro- 
е the random-variable factors, analytically, into maximum 
ble agreement with the factors of the hypothesis matrix, In 
Procrustes procedure a transformation matrix, T, is defined in 
Buch a way that it will carry the obtained factors, A, into another 
H’ and allow the sum of squares of the differences between H 
ients and the coeficients of H, an hypothesis matrix, to be as 
as possible for each factor considered separately and main- 
ng the communalities of A. 

The results asked for in the hypothesis matrix and those ob- 
ained by Procrustes analysis are summarized in Table 1. All load- 
8 larger than .20 (absolute value) have been listed. The names 
issigned to variables have been retained to indicate the nature of 
je substantive theory here employed, although, obviously, the 
aming is completely arbitrary. 


lagonal and the squared multiple correlations of each factor with 
others have been entered at the bottom. 

| Я 
In a factor analysis based upon a sample of 300 subjects, as 

presented in this study, it would not be at all unusual for an 

estigator to base his interpretations of factors on loadings of 
out .20 or .25 and larger. Yet adopting a rule like this, it is 
ent that the factors here defined on random, arbitrary vari- | 
bles can be readily interpreted in accordance with hypotheses. 

or example, factor 1 can be identified as Inductive Reasoning; 

2 would appear to be an intellectual speediness dimension; 

т 4 conforms to an hypothesis stipulating a spatial relations 

msion; factor 7 would seem to be a verbal comprehension 

and so on for the rest of the factors (see French, Ekstrom 

d Price, 1963 for descriptions of these factors as determined in 

earch on non-random variables). Even in cases where 

that were not expected on the basis of hypothesis, one can 

п, after the fact, that the loadings “make sense.” For ex- 

the presence of Forward Writing in factor 2 can be inter- 
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TABLE 1 


Hypotheses and Results from Factoring Random Variables* 


Factor 1 


1. Letter Grouping 

2. Number Series 

5. Furneaux Power 

6. Figure Series 

7. Topology 

9. Matrices Power 
65. Light vs. dark race 
Factor 3 

2. Number Series 

5. Furneaux Power 
12. General Reasoning 
19. Common Analogies 
26. Inferences 
28, Social Situations 

1. Letter Grouping 
42. Forward Printing 
62. Parents’ Education 


Factor 6 

17, Nonsense Memory 
18, Meaningful Memory 
41, Forward Writing 
44. Speed Reading 

58. Adventurousness 
Factor 8 


23. Mechanical Information 


24. Tool Identification 
14, Card Rotation 
52, Self Sentiment 1 


Factor 10 


29. Controlled Associations 


13. Match Problems 
Factor 12 


32. Adding 

33. Careful Subtracting 

34. Careful Dividing 

35. Multiplying 

36, Mixed Arithmetic 

37. Careful Fractions 
1, Letter Grouping 

Factor 14 

40. Designs 


Factor 16 
44, Speed Reading 


На 


Рори 


оъ 


[C iy oo 


oin 


eH 


aH ьљһььһ H 


4 


RFs 


Factor 2 


H 


33 
+26 
.00 


.38 


3. Furneaux Speed 

7. Topology 

8. Matrices Speed 
10. Figure Classify 
16. Paper Form Board 
41. Forward Writing 


Factor 4 

14. Card Rotations 
15. Figures 

66. Height 


Factor 5 


10. Figure Classify 

13. Match Problems 

16. Paper Form Board 
71. No name 

Factor 7 

19. Common Analogies 
20. Abstruse Analogies 
21. Vocabulary 

22. General Information 


Factor 9 
25. False Premises 
26. Inferences 

1. Letter Grouping 
53. Self Sentiment 2 
66. Height 
Factor 11 
30. Things Round 
31. Ideas 
28, Social Situations 


46. Agree with Platitudes 


50. Experience Claimed 
Factor 13 
38. Backward Reading 
39. Street Gestalt 

1. Letter Grouping 


Factor 15 


4, Careful Furneaux Speed 
11. Careful Figure Classify 


27. Careful Estimation 
33. Careful Subtracting 
34, Careful Dividing 
37. Careful Fractions 


ока 


эм оынң aaa H зы 


ook 


H 


ры 


CP 


RE 


.28 
.38 
17 
14 
—.20 
.21 


RF 
.89 
+20 
.20 


RF 


.20 
14 
.28 
—.24 
RF 
.00 
.32 
.94 
45 


RF 
.32 
.50 
.25 
.25 

—.21 

RF 
.25 
.58 

—.20 
—.23 
—.22 


RF 


27 
.92 
—.20 


RF 
.43 
.08 
.22 
.21 
.21 
.20 
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48. Highbrow Tastes .5 .81 
17. Nonsense Meaning .21 Factor 17 H RF 
Factor 18 H RF 41. Forward Writing :84 5:23 
45. Matching Letters .5 .88 
56. Test Anxiety .2 .10 46. Rapid Cancellation 5 .46 
57. 16 PF Tension .5  .45 6. Figure Series +26 
58. Adventurousness —.3 —.26 
22. General Information —.22 Factor 19 H RF 
81. Ideas .22 38. Backward Reading 2 .14 
45. Matching Letters —.22 42. Forward Printing 3 .21 
43. Novel Writing .5 .28 
46. Rapid Cancellation .5 ..81 
Factor 20 Н RF 54. Early Risks —.24 
50. Experience Claimed .4  .84 Factor 21 H RF 
51. Confidence in Skills .4 .23 
58. Adventurousness 5 .47 52. Self Sentiment 1 Бе ШӨ 
53. Self Sentiment 2 5 .38 


26. Inferences .28 
*H stands for “hypothesis” and RF stands for "random variable factors,” 


preted as indicating that this variable involves both speediness 
and some intellectual processes. Likewise Height in factor 4 can be 
explained as due to the fact males tend to be better in spatial 
tasks and taller than females. True, in some cases an “unexpected” 
loading presents a particularly troublesome problem in interpreta- 
tion (how would you explain the negative loading of Letter Group- 
ing in the Number factor 12, for example?). But this is not unlike 
the situation often encountered in factor analysis. It’s a good bet 
that in most such cases an experienced factor interpreter can come 
up with a plausible explanation. 

An outcome which suggests that the results of Table 1 are not 
valid is the fact that loadings are not very large. Most of them 
fall below .30, although several are above .4, But if it were not 
known that these results were obtained by the subjective proce- 
dures here employed, these loadings, although low, would not be 
entirely unrespectable. Certainly loadings as low as these have 
been interpreted in published studies. 

The correlations between variables are those expected by chance 
in samples of size 300. The communalities and factor loadings are 
determined by these values. A sample of 300 is fairly large relative 
to the samples used in substantive experiments, and the correla- 
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tions expected by chance in this sample are correspondingly small; 
hence, the factor loadings in these analyses can be small, as ob- 
tained. But if the N were reduced, factor loadings would average 
larger and thus appear to be more "respectable." Thus if N is 
small and subjective rotational procedures are used, an investi- 
gator should probably be wary about concluding that factor load- 
ings indicate meaningfulness, even if the loadings are “respectably 
large." 

The simple structure of the factors of Table 1 is “good” if by 
this is meant that hyperplane counts are large (and variable vec- 
iors are not normalized). The proportion of loadings within a 
+ .10 hyperplane is .60 or better for all factors, and for many 
factors it is larger than this. Thus it appears that hyperplane 
count, by itself, without being expressed relative to size of com- 
munalities, is not a sufficient indication of meaningfulness of a 
factor solution, 

Professor Lloyd Humphreys (personal communication) has ex- 
pressed the view that factors obtained by subjective rotational 
procedures will tend to be highly intercorrelated. Support for this 
hypothesis is indicated in Table 2. The correlations between fac- 
tors, are, in many cases, quite large by usual standards and the 
multiple correlations of each factor with the others indicate lack 
of independence in measurement of the attributes represented by 
factors. This kind of information may thus be used to indicate 
invalidity in a factor solution obtained by subjective means. 


Summary 


Subjectivity in factor analysis was defined in terms of lack of 
operational independence between an investigator’s substantive hy- 
potheses concerning factorial structure and the procedures used to 
determine factor structure. Subjectivity in this sense was seen to be 
distinct from the subjectivity involved in “blind” visual rotation. 
The study aimed to demonstrate that by use of subjective (in the 
first sense) rotational procedures, random, completely nonsensical 
variables could be made to define what seemed to be meaningful 
factors, 

Seventy-four random variables for an N of 300 were generated, 
arbitrarily named, intercorrelated, factored and rotated into a best 
(least-squares) approximation to an hypothesis factor matrix. The 
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resulting factors were found to be quite interpretable and to have 
high hyperplane counts, although communalities and factor load- 
ings were low. There was thus indication that if an investigator 
were willing to interpret relatively low loadings if these seemed to 
“make sense,” he needn’t bother to gather actual data: random 
variables may be labeled arbitrarily and pushed into solutions that 
make quite “good sense.” The intercorrelations among factors de- 
termined in this way tended to be large, however, thus suggesting 
that this kind of information can be used to indicate the invalidity 
of a factorial solution arrived at by subjective rotation. 
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THE USE OF MULTIPLE REGRESSION PROCEDURES 
WHEN THE PREDICTOR VARIABLES ARE 
PSYCHOLOGICAL TESTS 


LEROY WOLINS 
Iowa State University 


Mopznw textbooks on statistical methods uniformly caution the 
reader that regression procedures are based on the assumption that 
the predictor variables are not subject to error. However, when the 
predietor variables are psychological tests, this assumption is sel- 
dom, if ever, approximated. On the other hand, some of the impli- 
cations of statistical theory clearly point to procedures for hypoth- 
esis testing and estimation when the predictor variables are subject 
to error that are superior to usual procedures dictated by the inap- 
propriate statistical model. This statistical theory more or less 
(rather than precisely) defines the kind of inferences one can and 
cannot make from regression analysis based on such predictor vari- 
ables and suggests an alternative cross-validation design. 

The purpose of this paper is to expound rather than derive these 
principles. This exposition is intended to alert the quantitatively 
sophisticated reader to these principles and to provide the less 
quantitatively sophisticated reader with statistical tools more ap- 
propriate for psychological prediction than those offered in text- 
books, 

These principles require a psychologist to expound them appar- 
ently because the statistician, by temperament, requires more pre- 
cise statements than those offered here. However, it is anticipated 
that these principles will be endorsed (more or less) by most 
statisticians. 

Principle 1. Inferences drawn from estimates of and tests of hy- 
potheses concerning partial regression coefficients have greater 
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probability of being wrong than indicated by the usual procedures. 

The solution to a regression problem, the partial regression co- 
efficients, requires solving m + 1 equations for m + 1 unknowns 
where m is the number of predictors, However, when psychological 
tests aré the predictors it is probable that the solution obtained is 
not nearly as determined as the usual theory indicates. That is, if 
one is solving for these coefficients using correlations, the recom- 
mended procedure is to use unities in the diagonal of the correlation 
matrix, If the X variables are subject to error, unities in the 
diagonal are misleading since a variable subject to error cannot 
correlate unity with anything. Thus, the rows (or columns) of the 
correlation matrix are more nearly dependent on one other, or 
more alike, than it would appear from using unities as the diagonal 
entries. In some sense using reliabilities in the diagonal or correct- 
ing the correlations for attenuation will result in a meaningful 
solution, but this procedure is probably worse than the procedure 
Tecommended in textbooks for answering the questions about the 
best set of weights to use in a hew sample from the same popula- 
tion, However, the procedure of correcting correlations for attenua- 
tion and performing the Tegression on these corrected coefficients is 
probably better than the usual procedure for answering the theoret- 
ical question: How much is predictor X; related to Y independently 
of X5? However, coefficients based on corrected correlations are not 
amenable to significance tests either, so to judge the correctness 
of his results the investigator must depend on large samples and 
inspection of the estimates, 

As the m equations become more nearly dependent, the solution 
becomes less determined (Perloff, 1951). The more dependent the 
predictor variables are on each other, the more solutions there are 
which will result in essentially the same predicted values, The ex- 
treme situation occurs when all predictors correlate unity with each 
other. In this situation all linear combinations of X result in es- 


sentially the same predicted Y's (except for a trivial subset of 
linear combinations). 


For example, 
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Solving this system of equations, we find 


Accepting these results at face value, one might conclude variable 
2 is “good” but variable 1 is “bad.” However, in this small ex- 
ample it may be obvious that the difference between ry, and туз 
might be explained by the fact the error in X; correlates less highly 
with Y than the error in Xə does. As a result, it is not improbable 
that in the next sample the results could be reversed and this re- 
versal may occur as a result of measurement error rather than the 
sampling error accounted for by the usual procedure. 

In problems involving more than two predictor variables, the 
correlations ameng the predictors may be considerably lower than 
.90 yet spurious results may still occur due to the fact that any 
one predictor is highly dependent on a linear combination of the 
remaining ones. 

A corollary of this principle is that the usual procedure results in 

meaningful estimates and tests of significance for regression weights 
when the predictor variables are correlated near zero. This corol- 
lary is seldom helpful when the predictor variables are psychologi- 
cal tests (Levine, 1958). 
Principle 2. The estimate of the multiple correlation squared by 
the usual procedure is meaningful and the significance test applied 
to this index results in inferences that are misleading approxi- 
mately as often as advertised in statistical tables. 

As the intercorrelations among the predictor variables increase 
in size, the arrangement of the predicted values, Y’s, become in- 
creasingly less dependent on the values of the regression coefficients 
(Perloff, 1951). As these intercorrelations decrease in size the f 
values become increasingly more dependent on the values of the 
regression coefficients but the estimates of these coefficients become 
increasingly better. As a result, the multiple correlation squared, 
Еур, may be tested for significance from zero. 

The interpretation of R}p is usually correct as advertised in 
textbooks. However, if the psychological tests used in the regression 
equation are unreliable then the textbook interpretation of Ry» 
is somewhat misleading. If the composite, Y, is no more reliable 
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than У, the criterion, then Ry» probably better describes the propor- 
tion of variance in Y explained by f. However, most often psycho- 
logical tests are more reliable than criteria are, and a linear composite 
of reliable tests will probably be even more reliable than any one 
test. Thus, the proportion of variance in Y explained by f is usually 
closer to Вуз than Ву? but always lies somewhere between them. 
Principle 8. Indices based on relatively few variables selected from 
many are biased. 

When the investigator starts out with n predictor variables and 
through the test selection procedure (Wherry, 1950) selects a num- 
ber, т, of variables where m < n and these m variables result in 
a higher Ву? than any other set of m variables, the Ri is too 
large by some unknown amount and the regression coefficients are 
biased. 

It is common practice to cross-validate when indices are derived 
in this fashion by applying the weights derived from the sample 
in which the tests were selected to a new sample of observational 
units. Although this common practice has much to recommend it 
(Cureton, 1950) a better produre is available. 

Since the solution derived from the sample used to select the tests 
is biased, it is less likely that this solution is as near to optimum as 
the solution derived from the cross-validation sample. Despite the 
fact that weights derived from predictor variables subject to error 
are not as precise as indicated from the inappropriate model, such 
weights are the best estimates we can get of what will occur in a 
new sample. The shrunken multiple correlation based on a cross- 
validation sample using all the predictors specified in the validation 
sample is unbiased conditionally on the first sample (Wherry, 1931). 
Also Principle 2 applies to R5. based on the cross-validation sample. 

In other words the only purpose the validation sample should 
serve is to specify what predictors to use. Beyond this, information 
from the cross-validation sample should be used to estimate 
weights and the multiple-correlation squared, 

Principle 4. Many cross-validation studies are designed less than 
optimumally. 

An alternative design for cross-validation available to psychol- 
ogists when each variable is repeatedly measured is to randomly 
partition the measures of each variable into two groups. The best 
set of predictors may be determined from seores derived from the 
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first groups of measures, the regression coefficients and the R? val- 
ues may be unbiasedly estimated (in some sense) from scores de- 
tived from those second groups of measures which represent the 
variables selected for use. Most often the measures will be responses 
to test items in the case of the predictors. The measures of the 
criterion may be ratings from different raters or individual course 
grades if grade-point-average were the criterion. 

If the variables are measured highly reliably and the number of 
observational units, N, is small, this design will be better than the 
usual one, However, if the reliability of the measures is generally 
low and N is large then the conventional cross-validation pro- 
cedure would seem to be superior. Since many problems lie between 
these extremes, an algebraic statement of the relative efficiency of 
these two designs may prove useful. 

The formula “for testing the null hypothesis that the multiple 
correlation is zero using the cross-validation sample based on half 
of the observational units is 


R? 
Р, (м-та = и — 
(Gh in /(E- m - 1) 


Similarly, when the whole sample of observational units are used 
but each variable is measured using only half the items, the for- 
mula is 


(1) 


Rin 22 [m 
Ы 2 
астана RAN TAT етери у 
Pm- (1 — Rra ta)/N — m — 1) @) 
Using the Spearman-Brown formula it can be shown, in terms of 
expectations, 
1 1+7 Ry 
a seu (L + rrp худ at $/2 9/2) ү (3) 
Where груз уу is the correlation between the two measures of Y and 
"92 ?/ is the correlation between the two weighted composities of 
the X’s using the weights derived from the cross-validation samples. 
Substituting (8) into (2), 
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If the investigator has prior knowledge such that he can guess 
the various terms in (4), he may compare the efficiency of the 
design offered here relative to the conventional design. This rela- 
tive efficiency is obtained by comparing the ratio of (4) to (1) 
with the ratio of the tabular values of the two F ratios required 
for a designated significance level. 

The choice of design exploits, to some degree, a deficiency in 
data. Since the predictor variables are subject to measurement er- 
ror, one can cross-validate by sacrificing precision in sampling or 
measurement. Since one or the other must be sacrificed in order to 
eross-validate, one may be able to choose in an optimum way. 

Finally, a word of warning. The procedure of splitting tests lead 
to spurious results when reliability estimates are likely to be spuri- 
ously high, e.g.: 

1. The tests are speeded. In order for this procedure to be ap- 

propriate for speeded tests, scores from separately timed sec- 
tions of each test must be obtained. 
The items were grouped on the basis of factor analysis or 
item analysis using the sample of observational units on 
which the regression analysis is to be carried out. If items are 
placed into groups (or tests) on the basis of high intercorre- 
lations, then one expects a group of items defined from one 
sample to intercorrelate less highly when a second sample is 
used, Thus, the items must come to the investigator already 
grouped or the grouping of items must be done a priori for 
the present procedure to be appropriate. 


№ 


Discussion 

Typically when the psychologists use multiple regression, their 
results are subject to error due to both sampling and measurement. 
Statisticians have not developed procedures through which this sit- 
uation can be dealt with precisely. However, perusal of such theory 
that does exist strongly suggests certain procedures commonly used 
are not justified (see Principle 1), others are not optimum (see 
Principle 3), and still others have not been fully exploited (see 
Principle 4). 
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VALIDITY AS A FUNCTION OF EMPIRICAL SCALING 
OF TEST ITEMS BY A LOGISTIC MODEL* 


MANFORD J. FERRIS? 


Center for Research and Development in Higher Education 
University of California, Berkeley 


Y Logistic Model 

BorH the validity and the reliability of a mental test depend 
upon the characteristies of the items comprising the test. Estimates 
of two characteristics of a test item—difficulty and discriminating 
power—usually have been computed by the normal-ogive function. 
The normal-ogive is fitted to the observed proportions of correct 
item response at successive levels of the criterion measure. When 
this function is used, two test item parameters are found; one is 
the criterion value at which the probability of a correct response 
is one-half; the other is the reciprocal of the discriminating power 
or precision of the item. 

Use of the logistic function instead of the normal function for 
calculating maximum likelihood estimates of test item parameters 
has been proposed by Maxwell (1959), because simple sufficient 
statistics are available for straight-forward estimation of the 
Parameters of the logistic curve. The logistic function is given by: 
P= [1 + e-(@+62)], and the logit of P is defined as: logit P = 
In [P/(1 — P)] = а + fz. Both the normal and logistic models 
delineate growth curves of increasing functions; each has a char- 
acteristic S-shape with the period of greatest growth in the middle 
and slowest growth at the ends. Baker (1965) advocates currently 

1A revised version of a paper read at the Annual Meeting of the American 
Educational Research Association, Chicago, Illinois, February, 1966. Based on 
а dissertation submitted in partial fulfillment of the requirements for the Ph.D. 


degree at University of California, Berkeley, in 1965. 
? Now at the American Institutes for Research, Palo Alto, California. 
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using curve fitting techniques in item analysis, particularly since 
modern digital computers make the use of such techniques possible. 
Baker (1961) has compared empirical results obtained by the two 
models; he found that the logistic function could be used as an al- 
ternative to the normal function, provided the numerical value of 
the discrimination index from the logistic model is properly inter- 
preted, Values for the item difficulty parameter agreed closely be- 
cause of the symmetry of the two curves about their midpoints. 
Baker also found that the computer cost of analysis using the logis- 
tie function was approximately one-third the cost of using the nor- 
mal function, 

The logistic model has been used in bioassay studies; according 
to toxicological terminology, P is the observed proportion affected 
by dosage z. If logit P is plotted against z, the resultant funetion 
is linear with intercept с and slope 8. When P = .50, logit P = 0.0 
and z = —o/; this gives the dosage level for which the probability 
of being affected is one-half. When the logistic model is applied in 
psychometric investigations, P is the probability of response at abil- 
ity level т. Again, when Р = .50, 2 = —a/8, which is the limen of _ 
the item or that value of z for which the likelihood is 50-50 that a 
student will respond correctly to the item, 1 

The logistic function associated with a test item is a response - 
curve for which simple sufficient statistics exist for estimating the - 
item parameters « and 3. Berkson (1957) has shown that the — 
maximum likelihood estimates of the item parameters are functions 
of their own sufficient statistic. Convergence to finite values through 
an iterative solution yields in the limit the maximum likelihood 
estimates for a and В. 


Item Difficulty 


Item difficulty is generally defined as the proportion of examinees | 
passing the item, i.e., the percentage of a given group making the | 
right response to the item. While an index of item difficulty is this 
proportion of correct response, for scaling purposes this p value is 
usually transformed into a corresponding z value read from а 
table of normal-curve deviates for a given area under the normal. 
curve, 

Scaling of test items using the normal curve results in scale 
values that express item difficulty in sigma units which indies 
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distance from the mean taken at zero. For a proportion of .50, the 
2 deviate is zero; proportions less than .50 have positive scale val- 
ues, and proportions greater than .50 have negative scale values. # 
In other words, the higher the proportion passing an item the lower 
the item is located on the difficulty scale. This transformation from 
p to z provides a scale value that is a direct measure of item 
difficulty, whereas before, p was inversely related to difficulty. 

Since item analysis is often performed against total score on the 
test itself, the index of difficulty or limen of a test item is defined 
in the study as the raw (total test) score for which the probability 
of an examinee's having passed the item is .50, where the raw score 
of an examinee is the number of test items he responded to cor- 
rectly. Thus, when the limen of a test item is found to be 29, this 
means that if a student received a raw score of 29, there is a 50-50 
chance that he*'answered the item correctly. For students whose 
Taw scores were less than 29, the probability of their giving the 
right response to the item is less than .50; students with raw scores 
greater than 29 have a higher likelihood than .50 of having re- 
sponded correctly to the item. 

For each possible raw score z on a test, there is an observed 
proportion of subjects receiving that raw score who correctly an- 
swered a particular item. These observed proportions, after being 
corrected for chance success, are curve fit by the logistic function. 
The resultant theoretical corrected proportions are then trans- 
formed to theoretical observed proportions. These four proportions 
for raw score z are related in the following ways: 


(1) observed proportion passing = Р, = P.* + (1 — P,*)/k 
where / = number of choices given for the item 
(2) corrected proportion passing = Р,* = (kP, — 1)/(k — 1) 
(3) theoretical corrected proportion determined by computer 
program using logistic function = P,* 
(4) theoretical observed proportion = P, = [(k — 1)/k]P.* + 1/k. 
The relationship between observed and corrected proportions 
Passing an item as expressed in equations (1) and (2) requires an 
explanation. If a testee guesses when he does not know the right 
Tesponse to a multiple-choice item, it is frequently assumed that he 
has a 1/k probability of guessing correctly, where k is the number 
of choices given for the item. To allow for such chance success, a 
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plausible assumption is that the probability of a correct Tesponse 
increases from 1/k to 1 as ability inereases from minus infinity to 
plus infinity (Lord, 1953 and Birnbaum, 1957). Thus, the observed 
proportion of correct responses for raw score z in terms of corrected 
proportion passing is expressed as in equation (1). 

The logistic function is used to curve fit the graph of raw score 
versus corrected proportion passing for an item ; this results in P,* 
in expression (3). The resultant intercept and slope values, a and b 
which are estimates of а and B, are used to caleulate an estimate 
of the value of z for probability .50 on a theoretical fit of P.. According 
to the logistic model used for multiple-choice items in this study, 
the limen of a test item is: 2.3, = (In [k/(k — 2)] + a)/—b. 

Description of the Study 

The purpose of the study was to investigate the possibility of 
improving predictive validity of a standardized test by weighting 
the test items according to indices of item difficulty determined by 
the logistic model. The research hypothesis was that correlations 
between predictor and criterion scores based on empirically 
weighted items are higher than correlations between test scores 
each expressed merely as the number of items correctly solved. 

The quantitative section of the Cooperative School and College 
Ability Tests (SCAT) Form 2A was used as the predictor measure; 
the mathematics section of the Cooperative Sequential Tests of 
Educational Progress (STEP) Form 2A was the criterion test. 
Both tests contain 50 multiple-choice items; SCAT-Q has five al- 
ternative responses per item and STEP-M has four-choice items. 

The study sample was composed of 1,893 eleventh grade students 


who had responded to all 50 items on each of the two standardized 
tests. 


Results 

The correlation between the SCAT-Q and STEP-M raw test 
Scores was .792 when all 50 items were included in the total test 
Scores and were conventionally weighted 1 or 0. Five of the 
SCAT-Q test items and two STEP-M test items were eliminated 
from subsequent analyses because the logistic model showed that 
the indices of difficulty for these seven items were less than the 
minimum expected raw score acquirable through guessing alone, 
which is found by dividing the total number of test items by the 
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number of choices per item. One STEP-M item was deleted be- 
cause the difficulty index for it was higher than the total possible 
score on the test. When these eight items were eliminated, the 
correlation between the SCAT-Q and the STEP-M test scores was 
.795. Therefore, the logistic model designated eight items on the 
two tests that had no real effect on the intertest correlation. This 
finding shows that the logistic model might be quite useful in the 
item analysis phase of test construction. 

A canonical correlation analysis, which found the theoretically 
largest possible correlation between linear combinations of the pre- 
dictor and criterion measures, gave a maximum correlation of .850. 
Obviously, with a raw score correlation of .795 and a computed 
maximum possible correlation of .850, there is not much leeway for 
improving the intertest correlation by weighting the test items. 

For each subject in the study sample, three scores were available 
for each test: (a) his raw score obtained by simply summing the 
number of problems correctly solved on each test; (b) his test 
Score determined by summing the item weights based on propor- 
tions passing the items; and (c) his test score calculated by sum- 
ming the item weights based on the logistic model. Pearson r corre- 
lations between the respective six total test scores determined for 
each subject, means, standard deviations, and summations of test 
item weights expressed as normal deviates to which +3 was added 
to eliminate negative signs are given in Table 1. 


TABLE 1 


Correlations, Means, Standard Deviations, and Summations of Item Weights 
for 46 SCAT-Q and 47 STEP-M Test Items 


SCAT-Q |. STEP-M 
Raw Proportion Logistic Raw Proportion Logistic 
1 2 3 4 5 6 
1 y .996 .990 .795 .788 .768 
2 : 998 808 798 «784 
3 ВН ОБО ИЫ .792 
4 .997 .991 
5 256 .998 
6 33 
x 26.09 78.53 72.60 22.94 68.87 73.90 
SD 9.04 29.62 29.23 8.68 26.72 29.23 
Zw 45.00 145.88 142.46 47.00 148.30 165.98 
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1% had been hypothesized that the correlations would increase as 
one follows the dash line down and to the right. The correlation 
between logistically weighted SCAT-Q scores and raw STEP-M 
scores was .813. The correlation between logistically weighted 
SCAT-Q scores and logistically weighted STEP-M scores was 
.792; this correlation is no better than the correlation between the 
conventionally weighted scores on the two tests. 

The results of the study then do not confirm the complete re- 
search hypothesis; however, at the same time, the results do not 
clearly indicate that further investigation would be unwarranted 
or fruitless. In general, the findings of the study tend to indicate 
that rescoring tests by using item weights computed by the logistic 
model does not improve predictive validity of a test. Still, since 
the correlation between the predictor scores based on logistic item 
weights with the raw criterion scores was slightly higher than the 
correlation between raw predictor scores and the raw criterion 
scores, there may be merit in using the logistic model to determine 
indices of difficulty of test items for improving predictive validity. 


Discussion 

Despite the fact that the sample size was large, it may not have 
been the best sample to use to determine the estimates of the 
indices of difficulty for the test items, particularly for the STEP-M. 
In fact, this possibility may be substantiated by looking at the 
summations of test item weights in Table 1. The summations for 
the SCAT-Q test are quite close; the difference being 3.42. How- 
ever, there is considerable difference between the two STEP-M 
item weight summations, specifically, 17.68. This indicates that the 
item weights derived by the logistic model were evidently too high 
for the STEP-M test, possibly because the STEP-M test may have 
been exceptionally difficult for the sample involved in the study. 

In view of the fact that the correlation between the raw scores 
for the SCAT-Q and STEP-M tests was high, ie., .795, when 
compared with the maximum possible correlation of .850 as com- 
puted by canonical correlation analysis, the use of these two tests, 
which the Educational Testing Service has thoroughly refined, may 
not have given the logistic model a fair chance to demonstrate any 
possibility for improving predictive validity. This logistic model 
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should be applied to other tests in order to investigate further its 
potential usefulness in item analysis as well as in item weighting. 
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GRAPHICAL PROCEDURES FOR MULTIVARIATE 
CLASSIFICATION! 


GEORGE H. DUNTEMAN? 


Regional Rehabilitation Research Institute 
College of Health Related Professions 
University of Florida 


*. 

Born discriminant analysis and cluster analysis are useful tech- 
niques for examining multivariate data. However, if the group 
centroids are equal, then discrimination among groups by linear 
methods is impossible. If the dispersion matrices are unequal while 
the centroids are identical, then non-linear discriminant equations 
may be utilized for classification purposes, But, non-linear equa- 
tions are difficult to interpret and if the dispersion matrices are not 
multivariate normal, then the non-linear likelihood functions based 
upon the multivariate normal distribution are also inappropriate. 
If a linear or non-linear discriminant analysis is being conducted, 
for example, on two groups, then it could very well be that there 
are more than two modes in the multivariate distribution. It is 
conceivable that these two explicitly defined groups might well form 
three or more “natural” clusters in the test space rather than just 
one cluster if there were no differences among the groups or two 
clusters if the groups were different. The various methods of cluster 
analysis offer little help in isolating the number of “natural” clus- 
ters since they are based primarily on minimum variance partitions 
which may result in a large number of clusters even when the 
multivariate distribution is essentially unimodal. With all of these 
methods it is usually up to the researcher to decide upon the num- 
ber of clusters. 

— This research was supported by research grant RD-1127, from the Voca- 
tional Rehabilitation Administration, Department of Health, Education, and 


Welfare, Washington, D. C. 
? Now at the Educational Testing Service, Princeton, New Jersey. 
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The purpose of this paper is to outline a graphical procedure for 
classifying multivariate observations into two or more groups. For 
the purposes of exposition, it will be assumed that there are two 
groups of equal size. However, the procedure could be generalized 
to three or more groups of unequal size. The advantage of a graphi- 
cal procedure is that the magnitude of differences among and within 
groups can be easily visualized. 


Reducing Dimensionality 


The procedure involves (1) an attempt to represent the configu- 
tation of the people in the p (test) space (p > 2) by the first one 
or two principal components and then (2) graphically constructing 
decision boundaries to separate the a priori groups. The first part 
of this procedure has been discussed by Rao (1965). It involves a 
principal components analysis of the correlation matrix with one’s 
in the principal diagonal, obtaining the principal component scores, 
and plotting the scores for the subjects on the first two principal 
components. If the first two principal components account for a 
substantial part of the total test variation then the configuration 
of the individual's principal component scores in the principal com- 
ponents space will be a close representation of the actual configura- 
tion of their points in the original test space. If, for example, the 
first two components account for 80 or 90 per cent of the test 
variation, then for all practical purposes the configuration in the 
principal components space is extremely close to the configuration 
in the original test space. However, if the first two components 
account for only 30 or 40 per cent of the test variation, then, al- 
though this is the best representation that ean be obtained from 
two variables, it is not an adequate representation of the configura- 
tion of the groups in the test space, 

Rao (1965) recommends obtaining the difference between D? 
(D? equals the sum of the differences in test scores squared between 
any two individuals) in the original test space and D? computed on 
the basis of the first two components for each individual. If these 
differences are uniformly small, then we can determine the clusters 
on the basis of the first two principal components. If this is not the 
case, then we could disregard those Points where the difference 
between two D?'s are large and examine those where the differ- 
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ences were small, if we were going to cluster the points or draw 
decision boundaries for separating the groups. 

If there are many test variables, e.g. 20 or 25, then the proba- 
bility is low that two principal components can capture enough of 
the total test variation; conversely, if the number of variables is 
small, e.g. five or six, then there is a good chance that most of the 
test variation can be captured by two principal components. For 
example, the first two components of the female form of the SVIB 
(31 scales) might only account for 20 or 30 per cent of the total 
test variation because of the large number of scales and the heter- 
ogeneity of the intercorrelations. On the other hand, the first two 
principal components of the 10 basic scales of the MMPI might 
account for 70 to 80 per cent of the total test variation depending 
upon the sample of people from which the test scores were obtained. 

In general, the usefulness of the technique will depend upon the 
number of test variables and the multicolinearities among them. 
However, if the first two principal components are the best two 
variables to represent a large number of variables then the re- 
searcher may argue that he will consider these as a means of look- 
ing at the distances between people or groups. This is legitimate if 
the researcher keeps in mind that he cannot generalize from this 
two component configuration to the configuration of the people in 
the original test space. 


Decision Boundaries 


When the people points for two groups are plotted in the com- 
Ponent space, we can visually detect whether or not there are in- 
deed two bivariate distributions, one bivariate distribution, or 
three or more clusters. It helps to plot the centroids of the two 
groups. If they are very close to each other, and the first two 
components capture most of the test variation, then we might con- 
clude that a linear-two-group discriminant function analysis in 
the test space will probably not significantly differentiate between 
the two groups. If, however, the centroids are identical, but the 
dispersion of the two groups differ, then we may be able to separate 
the two groups by a non-linear boundary. 

If the two latent roots are of about equal size, then we might 
expect a relatively circular bivariate distribution with the centroid 
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Figure 1. No differences in centroids or dispersions, 


at the origin if the groups did not differ. This is illustrated in 
Figure 1. If we have two different groups, then we might expect two 
relatively circular distributions with either the same or different 
centroids. If the two relatively circular bivariate distributions have 
equal dispersion matrices, and the centroids are different, then the 
best decision boundary would be the perpendicular bisector of the 
line of means, This is illustrated in Figure 2, If the centroids are 
quite far apart from each other relative to the dispersions, then à 
linear boundary will probably suffice even though the dispersions 
may be quite dissimilar, If on the other hand, the centroids are 
equal but the dispersion matrices differ, then a straight line or 
linear boundary is not optimum. This is demonstrated in Figure 
3. The boundary in this cage could be either a circle or ellipse 
depending on whether or not, the magnitude of the latent roots are 
fairly equivalent. Even if the centroids are equal good discrimina- 
tion may be possible if there are gross dispersion inequalities. If a 
two group linear discriminant function was used in this case, then 
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Figure 2. Centroids different, dispersions equal. 


the differences between the mean discriminant score for each group 
Would be zero. Also, there may be situations where both the cen- 
troids and dispersions differ. We can then capitalize on both of 
these inequalities to define boundaries. One of these hypothetical 
situations is illustrated in Figure 4. 

The important point is that by plotting out the people points in 
the principal components space, we can usually ascertain if dis- 
crimination is possible, on either the basis of centroid differences, 
dispersion differences, or a combination of both these differences. 
An important point is that since we are capitalizing so much on 
the uniqueness of the data for establishing decision boundaries, 
cross validation is definitely in order. In general, it is desirable to 
Construct the simplest boundary possible. By simple algebraic 
graphical methods, we can easily determine the equation of the 
line, lines, ellipses, circles, or parabolas that best discriminate be- 
tween the two groups. An attempt should be made to place the 
decision boundaries where the density of the two groups’ bivariate 
distributions are approximately the same. 
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п 


Figure 3. Centroids equal, dispersions different. 


General Considerations 


From the hypothetical situations that were illustrated, it can be 
seen that when the dispersion matrices are unequal, or both the 
dispersion matrices and centroids are unequal, the boundaries are 
non-linear and consequently the interpretation of the data can be 
made in the traditional non-linear manner, e.g. those people scoring 
both high or low on both components are from group A and those 
people scoring average on both components are from group B. The 
interpretation from the linear models when only the centroids be- 
tween groups differ would follow the usual interpretation, e.g. peo- 
ple from group A score high on component 1 and low on component 
2 while those people from group B score low on component 1 and 
high on component 2. For an excellent discussion of the boundaries 
covered in this article, as well as other boundaries and other dis- 
tribution forms, see a series of articles by Cooper (1962a, 1962b, 
1963, and 1964). 
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Figure 4. Centroids and dispersions both different. 


The researcher must be careful in his interpretation. Even though 
the first two components might explain, for example, 95 per cent of 
the variation, the two groups may show no differences between 
themselves in regard to either the centroids and/or dispersions. 
However, there may be some relatively small principal components 
that contain significant information concerning between-group dif- 
ferences even though they account for a small percentage of the 
total test variation. However, especially if the reliability of the 
variables is relatively low, the smaller principal components could 
largely be a function of error and consequently the group differ- 
ences might be of little practical significance. 

It should be pointed out that if the data are distributed in 
Multivariate normal form and the number of groups and/or sub- 
jects is large then discriminant analytic techniques would be more 
effective, and easier to implement. The present technique would 
Seem to be especially useful for preliminary analyses to gain in- 
sight into the general nature of group differences. 
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а 


оста, desirability has gained considerable attention as a re- 
sponse set, defined by Cronbach (1946) as a tendency to give 
different responses to test items than would be given if the content 
Was presented in another form. Two aspects of social desirability 
have been defined, that of item content and that of response style 
(Christie and Lindauer, 1963; Crowne and Marlowe, 1960; Wig- 
gins, 1962). The former refers to sources of variance in the instru- 
ment whereas the latter refers to stylistic consistencies or sources 
of variance in the subjects. 

Crowne and Marlowe (1960) define social desirability as the 
need to obtain approval by responding in a culturally appropriate 
and acceptable manner. Correlations with the K and L scales of 
the MMPI suggest a quality of defensiveness or attempt to put 
the self in a favorable light. The question may be raised as to 
whether this tendency or need will carry over to self-reports on 
other instruments and the conditions under which the tendency 
Will become most; manifest. If the Social Desirability Seale (SDS) 
Measures а tendency to put the self in a favorable light, then 
subjects scoring high on the SDS should also tend to rate them- 
selves more favorably and see fewer discrepancies between them- 
Selves and their ideal selves. Using other measures of SD, Husek 
(1961) found that subjects high on social desirability saw them- 


Тһе data analysis was done at Princeton University's Computer Center 
№ 5 ch is supported in part by National Science Foundation Grant NSF-GP579. 
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selves as intelligent and good leaders and Murstein (1961) found 
a relationship between social acquiescence and low self-ideal self 
discrepancies. In a study of the semantic differential, Ford and 
Meisels (1965) found a high correlation between social desirability 
ratings of scales and evaluative factor loadings. They concluded 
that social desirability and the-semantic differential tap quite sim- 
ilar behavioral processes and predicted that those scoring high on 
a measure of social desirability would evaluate themselves in a 
positive way on the semantic differential. 

. The Ford and Meisels prediction, as yet untested, also suggested 
that the SD tendency would become most manifest on specific se- 
mantic differential content—scales having high loadings on the 
evaluative factor. Since response sets are assumed to become most 
manifest where uncertainty is great, one might also expect the SD 
tendency to become most manifest on those scales where the indi- 
vidual is least certain about his judgments. Finally, one might also 
expect the SD tendency to become most manifest on those scales 
considered most important by the subject. 

The research reported here investigates questions relevant to 
both subject and instrument variance and tested the following hy- 
potheses: (1) High scores on the Marlowe-Crowne SDS will be 
related to a tendency to judge oneself positively and report few 
self-ideal self discrepancies on the semantic differential. (2) This 
relationship will hold most for scales loading on the evaluative 
factor, for scales rated as important and for scales on which sub- 
jects are uncertain of their ratings. 


Subjects and Procedure 


The Ss were 50 male undergraduates and 50 female undergrad- 
uates. They were administered a semantic differential with the 
standard instructions (Osgood, Suci and Tannenbaum, 1957, p. 
82) followed by the concept MY SELF. The concept was rated on 
13 scales representing five evaluative scales, four activity scales 
and four potency scales. After the Ss made their ratings, they were 
asked to indicate how certain they were of each scale rating. Cer- 
tainty ratings were made on a 4-point scale from very uncertain to 
very certain. Ss then turned the page and rated the concept MY 
IDEAL SELF on the same 13 scales. Following this they were 
asked to indicate, on a separate page, how important each scale 
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“was as a personality trait. Ratings were made on a 4-point scale 

from very unimportant to very important. Finally, the Ss filled 
out the Marlowe-Crowne Social Desirability Scale, consisting of 
33 true-false items about the self. 


Results 


The relationship between SD scores and self ratings was tested 
in two ways. First, for each scale a product-moment correlation 
was run between the Ss’ scores on the scale and their SD scores. All 
scale values were scored so that a high score indicated a high 
self rating on the factor and a positive correlation would indicate 
arelationship between SD and high self ratings. Each individual 
also received a factor score representing the sum of his self ratings 
on the scales releyant to each of the three factors. In a second test 
of the relationship between self ratings and social desirability, the 
factor scores were correlated with the SD scores. The scale and 
factor product-moment correlations for males, females and the to- 
tal sample are presented in Table 1. 

The factor correlations indicate a significant relationship between 
SD scores and high evaluative ratings, but not for activity or 

$ 
TABLE 1 
Correlations of Social Desirability Scores with Scale and Factor Self Ratings 


Males Females Total 


Scale N=50 N=50 Л = 100 
1. sociable-unsociable (E)* 11 .27* .20* 
2. good-bad (E)* .28* 17 .23* 
3. active-passive (A)* .98** .48** .44** 
4. eager-indifferent (A)* .21 .16 .21* 
5. strong-weak (P) .18 15 14 
6. free-constrained (Ру .16 .98** „26** 
7. kind-cruel (ЕЁ) *86°* .18 ТОУ 
8. unselfish-selfish (E)* .97** $31 .29** 
9. rash-cautious (A) —.32* mb —.24* 
10. excitable-calm (A)* —.15 —.17. —.13 
11, severe-lenient (P)* —.02 —04. —.03 
12, hard-soft (Ру —.01 .12 = 01 
us Wise-foolish (E)* .26 11 .18 
‘acti 
Evaluative Age :30* ape 
Activity —.01 14 .08 
= 0 14 126 116 


#2 «05 two-tailed test 
NS 01 two-tailed test 
valuative; A = Activity; P = Potency 
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potency ratings. While there is variation in the size of the correla- 
tions for the scales, they follow the same general pattern as was 
true for the factors. 

The discrepancy between an individual's self and ideal self rat- 
ings was calculated for each scale and for the three factors, as was 
done on the self ratings. Hach individual also received a total 
discrepancy score, representing the sum of the discrepancies be- 
tween his self and ideal self ratings on the 13 scales. The product- 
moment correlations between these values and the SD scores are 
presented in Table 2. According to the hypothesis, a high SD 
score should be related to a low discrepancy score, resulting in 
negative correlations. For the entire sample, the sum discrepancy 
scores were significantly correlated with the SD scores (т = —.27, 
p < .01). On the three factors, only the relationship between 
evaluative discrepancy scores and SD was significant. In sum, the 
data indicate that high SD scores are related to a tendency to 
judge oneself positively and to report few self-ideal self discrep- 
ancies. Furthermore, this relationship holds for judgments relevant 


>» 


i TABLE 2 
Correlations between Social Desirability and Self-Ideal Self Discrepancy Scores 
" Males Females Total 
Scale N=50 N=50 М=100 
» 1. sociable-unsociable (E)s .11 —.26 —.10 
A 2. good-bad (E) —.22 —.14 —.21* 
3. active-passive (A) —.23 —.28* = .26% 
4. eager-indifferent (A) 12 .21 12 
5. strong-weak (P) —.08 —.18 -.14 
6. free-constrained (Р) —.04 —.23 —.12 
7. kind-cruel (E —.80* —.02 —.17 
8. unselfish-selfish (E) —.16 —.28* —.22* 
9. rash-cautious (4)» —.12 —.22 —.18 
10. excitable-calm (4) —.01 —.09 —.02 
11. severe-lenient (Pe .07 .00 .02 
12. hard-soft (P)s .02 .06 .03 
13. wise-foolish (ЕЁ) —.21 —.18 —.21*® 
Factors 
Evaluative —.98* —.28* —.29** 
Activity ES —.22 —.18 
Potency —.01 —.17 —.10 
"Total —.22 —.30* —.27* 


*p <.05 two-tailed test 
**p <.01 two-tailed test E: 
^E = Evaluative; A = Activity; P = Potency к 
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to an evaluative factor to a far greater degree than for judgments 
relevant to activity and potency factors. 

The second hypothesis suggested that the tendency to report few 
self-ideal self discrepancies would most hold for scales on which 
subjects were uncertain of their judgments and for scales rated as 
important. To test this hypothesis, the total sample of 100 Ss was 
divided into low, medium and high SD scorers. Each subject ob- 
tained a mean self-ideal self discrepancy score for each of the four 
kinds of certainty and importance ratings. A mean discrepancy 
seore for each certainty and importance rating was computed for 
the Ss in the low, medium and high SD groups. The data indicated 
that the three groups did not differ in the degree to which they 
used the four certainty and importance categories. The second hy- 
pothesis would suggest that self-ideal self discrepancies would de- 
crease with more uncertain and important judgments, and that this 
would hold more for those high on SD than those low on SD. That 
is, the SD tendency would become most manifest on uncertain and 
important judgments. The relevant data are presented in Table 3. 
The certainty data indicate the reverse of the hypothesis. Ss judged 
themselves to be more certain of judgments where low self-ideal 
self discrepancies were reported and this was particularly true for 


TABLE 3 и. 


Mean Self-Ideal Self Discrepancy Scores on Different Certainty and Importance 
Ratings for Low, Medium and High Scores on Social Desirability (SD) 


Mean Discrepancy Score Ф 
Certainty Ratings Importance Ratings 
T 2. 18/3 4 1 PME: 4 
Low High Low High 
Low SD N = 32 1.67 1.68 1.40 1.47 1.02 1.52 1.44 1.49 
тз” = Eq (C Я РР аи 
Medium SD 
N=38 1.82 1.58 1.17 1.21 1.10 1.38 1.24 1.30 
Mi DN = 30 1.70 1.53 1.07 .98 .89% 1.15 1.07 1.02 
Total № = 100 1.72 1.60 1.21 1.22 1.02 1.35 1.25 1.28 
шо 100 UNSER oe р ЕА RN 
T value for 


low vs, high SD .13  .68 2.35* 2.00** .54 1.63 2.13* 2.71** 
де M ын О as В Ора СОЕ 


*p «05 
=p «01 
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those high on SD. For the Low SD group there was an insignificant 
difference between the mean discrepancy scores for low and high 
certainty ratings (t = .36, ns) whereas for the High SD group the 
mean discrepancy for high certainty ratings was significantly lower 
than that for low certainty ratings (t = 3.60, р < .001). 

For the importance ratings, it was not generally true that self- 
ideal self discrepancies decreased as scales were rated as more im- 
portant. However, while Low and High SD groups did not differ 
significantly in their mean discrepancy scores for low importance 
ratings, the difference was quite significant and in the predicted 
direction for high importance ratings. 

Another way of looking at these relationships is through the 
use of correlations. The second hypothesis would suggest that for 
high certainty and low importance ratings there would be little 
relationship between SD and self-ideal self diserepancy scores, 
while for low certainty and high importance ratings there should 
be a significant negative correlation between SD and self-ideal 
self discrepancy scores. The relevant data are presented in Table 
4. Once more the data clearly indicate that SD scores are most 
related to self-ideal self discrepancy scores for high certainty and 
high importance judgments. The difference in correlations for very 
low and very high ratings was significant for certainty ratings 
(2 = 2.13, р <.05) but missed being significant for importance 
ratings (2 = 1.31, ns). 

The data indicate a clear relationship between social desirability 
as measured by the Marlowe-Crowne SDS, self ratings on the 
semantic differential and self-ideal self discrepancies on the se- 
mantic differential. In terms of subject variance, the data support 
the conclusion of Christie and Lindauer (1963) that the SDS is 
getting at a meaningful behavioral dimension. In terms of instru- 
ment variance, the data indicate that ratings on the semantic dif- 
ferential can be considerably influenced by a social desirability 
factor. However, the correlations were not so great as to support 
the Ford and Meisels conclusion that the two reflect similar judg- 
mental processes or that “the social desirability variable in self- 
descriptive statements may be treated as a special case of this 
more general dimension” (1965, p. 472). 
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TABLE 4 


Correlations between Social Desirability and Self-Ideal Self Discrepancy for 
Different Certainty and Importance Ratings. N = 100 


Certainty & Correlations 
Importance Rating Certainty Importance 

Low 1 .00 —.04 

2 —.11 —.21* 

3 -.11 —.23* 
High 4 —.29** —.23* 
1&2 —.14 —.19 
3&4 —.928** —.30** 


*p <.05 two-tailed test 
<.01 two-tailed test 


The data support Moscovici’s (1963) conviction that one must 
consider the question of content with regard to which the SD tend- 
ency manifests itself. The results reported here would suggest that 
the SD tendency becomes particularly critical on adjectives or 
scales having a high evaluative loading. While this study did not 
Tepresent a direct test of Cronbach’s (1946) hypothesis that re- 
sponse sets have the greatest influence on scores in uncertain or 
ambiguous situations, contrary to the hypothesis the data indicate 
à relationship between the influence of SD and a feeling of cer- 
tainty about the judgment made. It may be that Ss defensively 
report themselves as certain of judgments on which there is a small 
self-ideal self discrepancy. The data in relation to importance rat- 
ings would suggest that people high on SD tend to report small 
self-ideal self discrepancies on scales they experience as important 
or that they judge as important scales on which they experience a 
small self-ideal self discrepancy. 

These results are important for all investigations that involve 
Subjects rating themselves and their ideal selves. In her review of 
the self concept literature, Wylie (1961) commented upon the need 
to Specify the conditions under which the SD tendency will distort 
Self and ideal self ratings. The present study contributes toward 
greater clarity in this area. In studies of psychological health using 
these ratings a small self-ideal self discrepancy may either reflect 
health or a strong defensive process. Particular attention should be 
given to the SD factor when ratings are made on evaluative type 
adjectives, on important scales and when they are reported to have 
been made with great certainty. On the other hand, the data also 
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suggest that the semantic differential and Marlowe-Crowne instru- 
ments do not measure the same thing. The SD factor influences 
self and ideal self ratings but rarely can it be equated with them. 
In sum, studies using self and ideal self ratings might profitably 
use the SDS as a measure of defensiveness, somewhat in the same 
way that the К and L scales are used on the MMPI or that 
Wallach, Green, Lipsitt and Minehart (1962) used a measure of 
defensiveness to explain contradictions between overt and projec- 
tive measures of personality. 


Summary 


This e tested two hypotheses: (1) High scores on the 
Marlowe-Crowne SDS will be related to a tendency to judge one- 
self positively and report few self-ideal self discrepancies on the 
semantie differential. (2) This relationship will hóld most for ad- 
jective scales loading on the evaluative factor, for scales rated as 
important and for scales on which Ss are uncertain of their ratings. 
High SD scores were found to be significantly related to high self 
judgments and low self-ideal self discrepancies on scales high on 
the evaluative dimension, but not on scales high on the potency 
and activity dimensions. High SD scores were found to be most 
related to low self-ideal self discrepancies for high certainty and 
high importance ratings. The study indicated the need to consider 
the content on which the SD tendency manifests itself and the 
possible effects of defensiveness on self and ideal self judgments. 
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THE EFFECTS OF INSTRUCTIONS AND ITEM 
CONTENT ON THREE TYPES OF RATINGS: 


HUGO A. BORRESEN? 


s SPE - 
Vienna, Virginia 


Reports of self and others are important in psychological re- 
search because they provide data which are otherwise difficult to 
obtain. These reports have become more valuable as knowledge of 
the factors which affect them has increased. Among the many in- 
fluences on subjects using self-reports are the instructions given to 
them and the rating instrument they use. 

The instructions given to subjects using an evaluation instru- 
ment might affect their responses. When instructed to do 80, par- 
ticipants can simulate being masculine or feminine (Kelly, Miles, 
and Terman, 1936) or well-adjusted (Kimber, 1947; Noll, 1951). 
They may. alter their responses in relation to the time and methods 
of feeding back results (Angell, 1949; Pressey, 1950; Smode, 1958). 
They can change their responses with their purposes (Davids, 
1955; Heron, 1956). 

The test items in a rating instrument might also affect the sub- 
jects’ responses. Both the social desirability of an item (Edwards, 
1957; Rosen, 1956) and the threshold of emotional words (Bruner 
and Postman, 1947) may influence raters. 

Bronfenbrenner (1958) states that most people will describe 
themselves favorably on questionnaires or rating scales and will 
estimate that others’ self-ratings will be favorable. The person 
Who does not assume he is similar to others will rate himself un- 
ву but will predict that others will rate themselves favor- 
ably. 


Ө — s 
1 Submitted in partial fulfillment, of requirements for the Ed. D. degree. 
stem Development Corporation performed the analysis of variance on 
^ computer as part of its policy of assistance to employees. 
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Purpose 

The purpose of this study, in exploring such behavior in grea 
detail, was (1) to determine how written instructions affect 
ratings of and by participants; (2) to determine how the contents 
of the items used in the ratings affect the responses; and (3) to 
determine some characteristics of those who rated themselves dif- 
ferently from the majority. 


Method 


The jects received instructions containing three variables, 
They rated themselves, rated others, and estimated how others 
rated them. 

The instructions contained three variables: the purpose given 
the subjects for making the ratings, the amount of*subjective judg- 
ment the subjects were asked to use, and the promise of being 
given the results of the ratings (feedback) made of them by others. 
This promise was made to half the participants but was not men- 
tioned to the rest. | 

The stated purpose contained one of three degrees of personal 
threat. Twelve judges ranked from most to least threatening the 
three purposes (here paraphrased) : 


To evaluate candidates for teaching credentials to determine 
who will receive university recommendation 
To permit as a course requirement the instructor to become 
better acquainted with each individual 
To develop written instructions for the rating device to be 
used with future classes 


For the subjective judgment, the judges ranked these instructions 
(also paraphrased) from most to least subjective: 


Use personal judgment, even if the impression is difficult to 
describe verbally 

Recognize personal feelings, but be certain that the item de- 
scribes actual, observed behavior 

Rate only definitely observed behavior 


Each subject received an instruction sheet containing one of the 
three purposes, one of the three degrees of subjective judgment, 
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and in 50 percent of the cases, a promise of feedback of others’ 
ratings. 

The rating device used for the evaluations contained 44 items 
greatly modified from Carter's (1954) factor analysis. The indi- 
vidual items described behavior on two dimensions: favorable-un- 
favorable and observable-unobservable. A favorable item de- 
scribed behavior that was flattering or similar to the subject’s own 
behavior; an unfavorable item described behavior which was un- 
flattering, passive, or was regarded by the rater as dissimilar to his 
own. An observable item described behavior that could probably 
be viewed in the testing situation; an unobservable item: described 
behavior that could probably not be observed in the testing situa- 
lion. These examples illustrate each type: 


l. Who was “unusually adept at making others feel at ease? 
(favorable and observable) 

2. Who talked a good deal without saying anything? (unfavor- 
&ble and observable) 

3. Who seemed to prefer to let others make the plans and deci- 
sions? (unfavorable and observable) 

4. Who probably agrees with you regarding what is wrong with 
the schools? (favorable and unobservable) 

5. Whose moral standards are probably quite different from 
yours? (unfavorable and unobservable) 


The items were developed to measure five behaviors: leadership 
(item 1, above), disapproval (item 2), passivity (item 3), similar- 
ity (item 4), and dissimilarity (item 5). 


Procedure 


Upper division students from two classes in Educational Psy- 
chology in a western university met for two class periods to discuss 
8 case study of a junior high school student. These candidates for 
teaching credentials were randomly placed in groups of seven to 
facilitate discussions. Their purpose was to complete a three-part 
assignment which included writing questions raised by the discus- 
sion, suggesting solutions which could be used in a public school, 
4nd listing principles of guidance learned in the course. 

In the third class period, each participant was given three items: 
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the 44-item rating device; an answer sheet containing his name and 
the names of the other members of his group; and an instruction 
Sheet that contained one of the three purposes, asked for one of the 
three degrees of subjective judgment, and either promised feedback 
or made no mention of this, They were asked to rate themselves, 
rate others, and estimate how others rated them. 

After unsatisfactory responses were eliminated—incomplete, 
omission of estimates, etc.—70 subjects furnished the data for the 
statistical analysis. The effects of the instructions were tested by 
analysis of variance; the ratings of others and estimates of others’ 
ratings were correlated and the mean ratings of selected raters were 
compared, 


Results А 


1. The instructions did not affect the ratings. Fewer responses 
were significant than expected by chance alone. Those which were 
significant showed no systematic influence by the three variables 
on the three types of ratings. 

2. The behavior described by the items had a definite effect on 
the ratings. Most of the favorable items, whether observable or 
unobservable, elicited more responses than did unfavorable items 
on three types of ratings, and these ratings were highly correlated. 
Thus, subjects responded more frequently to the favorable items; 
they rated themselves, rated others, and estimated others’ ratings 
very similarly, supporting Bronfenbrenner. 


TABLE 1 


Number of Favorable Items Ranked Highest, by Numbers of Raters Using Them, 
Ratings Made, and Mean Ratings Made for Three Types of Ratings 


Types of Ratings 
Self Others Estimate 
ил оО 


Favorable items in rating device 10* 19 19 
Number of items having highest 
number of raters 10 16 18 
oe 10 16 19 
mean ratings 10 17 17 


*Nine of the 19 favorable items did not permit self-ratings, e.g., Who has subject matter interests 
similar to your own? ый 
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Furthermore, more favorable than unfavorable items were signifi- 
cantly correlated for all three types of ratings. 


TABLE 2 
Number of Significant Correlations for Favorable and Unfavorable Items by 
Type of Ratings 
Self- Self- Other- 
Other Estimate ^ Estimate 
Possible 10 10 19 
Favorable Actual 7 9 16 
Per cent 7095 90% 84% 
Possible 16 16 25 
Unfavorable Actual 2 4 10 
» Рег cent 12.5% 25% 40% 


On unobservable items, the ratings of others were significantly cor- 
related with the estimates of others’ ratings. The correlations were 
higher for the favorable than for the unfavorable items. 

The intercorrelations of ratings of others with estimates of 
others’ ratings were: 


Item Correlated With an item which was 
favorable positively - favorable 

unfavorable positively unfavorable 

favorable negatively unfavorable 


8. Individual differences were evident in responses to the test 
items, Twenty-one subjects indicated that one or more of the un- 
favorable items described them. Their peers did not agree with 
Seven subjects on eight items, suggesting they underestimated 
themselves, For five subjects, peers agreed with them on some but 
Aot on others of eleven described behaviors, implying that they 
only partially underestimated themselves. For the remaining nine 
Subjects, fellow raters agreed that they exhibited the self-described 
unfavorable behaviors. The mean ratings received by all twenty- 
one subjects on these unfavorable items were not significantly 
higher than those received by other subjects. 
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TABLE3 


Number of Items on which Other Raters Concur with the Twenty-one Subjects 
and Mean Ratings Received by Them on All Items 


Concurrence 
Total/ 
None Partial Complete Average 
Number of subjects 7 5 9 21 
Number of items 8 11 14 33 
Mean ratings received on 
favorable items* 33.45 35.25 33.51 34.07 
Mean ratings received on 
unfavorable items* 11.18 14.76 6.33 10.67 


The subjects in the rest of the sample received a mean rating of 39.33 on favorable items and 
6.86 on unfavorable items, 
However, the 21 subjects rated others favorably, supporting Bron- 
fenbrenner. They were described by their peers as lacking certain 
observable verbal and social skills: 


Failing to present information or suggestions (passive) 
Talking a good deal without saying anything (disapproval) 
Trying to dominate the discussion (disapproval) 
Having annoying mannerisms (disapproval) 
Seeming insensitive to the feelings of others (disapproval) 
Talking about their own experiences and ideas to the an- 
noyance of others (disapproval) 
Making others feel tense, anxious, or hostile (disapproval) 
Having points of view seeming to make them much older or 
younger (dissimilar) 
They were also described as lacking certain unobservable verbal 
and social skills: 
Probably having taste in clothing, cars, and entertainment 
dissimilar to those of the rater (dissimilar) 


Choosing friends who would probably not appeal to the rater 
(dissimilar) 


Discussion 
à It is not clear why the instructions had no influence while the 
individual items did. Perhaps the items but not the instructions 


threatened the subjects. The social desirability of the items was 


more important than written instructions in affecting the responses 
of these subjects. 
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Responses to the social desirability of the items generally sup- 
ports Bronfenbrenner’s contention that subjects would be favorable 
toward themselves and others, This was also true when judgments 
were made of others’ behavior that could not be directly observed. 

Some of the deficient verbal and social skills of those who felt 
they showed unfavorable behavior was identified. But these critical 
behaviors were limited to the items in the rating device, so the 
contents of the items may need expansion in future research. The 
results suggest that a screening device can identify those who may 
need assistance in dealing with group situations. Students who un- 
derestimate themselves can be identified, and those lacking in ver- 
bal and social skills can be urged to receive counseling. The use of 
self-ratings may become more important in teacher-training insti- 
tutions as well as in other organizations which must evaluate in- 
dividuals. j 


REFERENCES 


Angell, George W. The Effect of Immediate Knowledge of Quiz 
Results on Final Examination Scores in Freshman Chemistry. 
Journal of Educational Research, 1949, 42, 391-394. 

Bronfenbrenner, Urie. The Study of Identification Through Inter- 
personal Perception. In Renato Tagiuri and Luigi Petrullo, 
(Eds.), Person Perception and Interpersonal Behavior, Stan- 
ford: Stanford University Press, 1958. 

Burner, Jerome S. and Postman, Leo. Emotional Selectivity in Per- 
ception and Reaction. Journal of Personality, 1947, 16, 69—77. 

Carter, Launor F. Evaluating the Performance of Individuals as 
шыч of Small Groups. Personnel Psychology, 1954, 7, 477- 


Davids, Anthony. Relations Among Several Objective Measures of 
Anxiety Under Different Conditions of Motivation. Journal of 
Consulting Psychology, 1955, 19, 275-279. n 

Edwards, Allen L. The Social Desirability Variable in Personality 
Assessment and Research. New York: Dryden Press, 1957. 

Heron, Alastair. The Effects of Real-Life Motivation on Question- 
naire Response. Journal of Applied Psychology, 1956, 40, 65-68. 

Kelly, E. L., Miles, C. C., and Terman, L. M. Ability to Influence 
One’s Score on a Typical Pencil-and-Paper Test of Personality. 

, Journal of Personality, 1936, 4, 206-215. 

Kimber, J. A. Morris. The Insight of College Students Into the 
Items on a Personality Test. EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT, 1947, 7, 411-420. 

Noll, Victor H. Simulation by College Students of a Prescribed Pat- 
tern on a Personality Scale. EDUCATIONAL AND PSYCHOLOGICAL 
Measurement, 1951, 11, 478-488. 


862 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Pressey, Sidney L. Development and Appraisal of Devices Pro- 
viding Immediate Automatic Scoring of Objective Tests and 
Concomitant Self-Instruction. Journal of Psychology, 1950, 29, 
417-447. 

Rosen, Ephraim. Self-Appraisal, Personal Desirability, and Per- 
ceived Social Desirability of Personality Traits. Journal of Ab- 
normal and Social Psychology, 1956, 52, 151—158. 

Smode, Alfred F. Learning and Performance in a Tracking Task 
Under Two Levels of Achievement Information Feedback. 
Journal of Experimental Psychology, 1958, 56, 297-304. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1967, 27, 863-869. 


THE CONNOTATIVE MEANING OF THE CONCEPT 
"PUBLIC SCHOOL TEACHERS”: AN IMAGE ANALYSIS 
OF SEMANTIC DIFFERENTIAL DATA 
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ArvTHOUGH the assessment of the meaning of concepts and atti- 
tudes has often been studied (e.g., Merton, 1957; Newcomb, 1958), 
it is still difficult to quantify word meaning. The Semantic Differ- 
ential is a simple and systematic procedure for quantifying the 
Connotative meaning of concepts (Osgood, Suci, and Tannenbaum, 
1957). It offers an original approach to the old but important 
problem of measuring the meaning of words. 

In the present study one facet of this important problem of 
Measuring the meaning of words was investigated. The connota- 
tive meaning among college students of the term “Public School 
Teachers” was studied by the Semantic Differential. In two other 
Studies (Husek and Wittrock, 1962; Wittrock, 1964), the Semantic 
Differential technique was used to determine the factors of con- 
Dotative meaning for the concepts “School Teachers" and “Public 
School Teachers.” In the latter study, six factors were found with 
a heterogeneous population. The factors were: Expressiveness, 
Tenacity, Stability, Potency, Predictability, and Evaluation. The 
Conclusion of these studies was that the factor structure depends 
Upon the heterogeneity of the population of subjects as well as 


‘pon the concepts they are rating and the scales they use in the 
rating, 
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The intent of the present study was to determine if other vari- 
ables, such as method of analysis and variety of concepts rated, 
also affect the factor structure of responses from raters. 


Method 
Subjects 


The subjects of the study were 425 college students, all from one 
university. There were 178 freshman and 90 sophomore students 
enrolled in beginning, required, English classes. There were 157 
students officially enrolled in the teacher education curriculum; 43 
of them were juniors and 114 were fifth year students. None of the 
subjects was a volunteer. 


Materials 


The test booklet given to the students consisted of three pages of 
directions, each followed by two pages of 7-point, bipolar scales. 
The undefined concept label “Public School Teachers” appeared at 
the top of each of two pages. A total of 29 bipolar scales were 
presented below the concept “Public School Teachers." 

The first page of the booklet presented directions very similar to 
those suggested by Osgood et al. (1957) for a Semantic Differ- 
ential. The directions also specified that not more than a few sec- 
onds be spent marking each scale. On the first two pages of scales, 
each subject was asked to indicate what he associated with the 
particular concept at the top of the page. For half of the booklets 
the second page of instructions asked him to indicate what he 
thought professors of history associate with the concept given at 
the top of the page. The other half of the booklets substituted 
professors of mathematics for professors of history. The third page 
of instructions was identical to the second page of instructions 
except that the subject was asked to indicate what he thought pro- 
fessors of education associate with the concepts. 


Procedure 


The Semantic Differentials were administered during regularly 
scheduled class periods, The subjects were told that they were par- 
ticipating in a study on the connotative meaning of the words 
“Public School Teachers.” They were then given the test booklet 
and were asked to read the directions and to complete the test 
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booklet. All the subjects were given sufficient time to complete the 
booklet. 


Results and Discussion 


The 87 X 425 data matrix was analyzed using an image model 
(Guttman, 1953) as modified by Harris (1962). The image model 
is an advantageous one because it supplies scale invariant factors, 
and the factor scores are computable rather than just estimable. 

The criterion used to determine the number of factors was Gutt- 
man's strong lower bound, equal to the number of positive latent 
roots of the correlation matrix with the squared multiple correla- 
tions in the diagonal. Fifty-one image factors, as determined by 
this criterion, were rotated using the normal varimax orthogonal 
rotation procedure. This produced an 87 X 51 rotated image factor 
matrix. This matrix contains the covariances of the image vari- 
ables (those parts of the original variables predictable from the 
86 remaining variables) with the image factors. Another 87 X 51 
matrix produced by the analysis contains the correlations of the 
original variables with the image factors. Before rotation these 
matrices differ only by a scale factor. In the tables which follow, 
the factor loadings will be the correlations of the original vari- 
ables with the factors. Of the total of 51 factors, only 18 are dis- 
cussed below; the other 38 factors were discarded. Each accounted 
for approximately 1 per cent of the total variance and many were 
difficult to interpret. It should be noted that image analysis pro- 
duces unique factors since it is not an ordinary factor model. 

In Table 1, the first three factors loaded highly on almost all of 
the 28 scales corresponding to each of the three roles, Self, Liberal 
Arts Professor, and Education Professor, which the students were 
asked to assume, Because of lack of space, in Table 1 only the 
scales with the three highest loadings on each respective factor 
are presented. The students were able to discriminate among these 
three roles, which together accounted for 29 per cent of the total 
variance. In Tables 2 through 7 below, all scales loading .30 or 
greater on the respective factors are presented. The factors 4 
through 13 are grouped to resemble the factor structure found in 
the earlier study (Wittrock, 1964) which used the same subjects 
but only the data from the Self role. 

Table 2 lists two factors, Kindness and Fatigue, which together 
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TABLE 1 
Roles 

Factor 1: Self Role $ 
Role Scale Loading f 

ag Interesting—Boring —.77 

8 Wise—Foolish —.78 

8 Sharp—Dull —.78 

Factor 2: Liberal Arts Professor Role 

"Г, Strong—Weak .81 

L Wise—Foolish .79 

L Sharp—Dull 77 
Factor 3: Education Professor Role | 

^Е Interesting—Boring .82 

E Colorful—Colorless .81 

E Sharp—Dull .80 
^S = Self Role й 
1, = Liberal Arts Professor Role | 
Е = Education Professor Role | 
" | 

TABLE 2 
Expressiveness 


—SSS ————————— 
Factor 4: Kindness 


* Role Scale Loading 
8 Kind—Cruel —.31 
L Kind—Cruel —.80 
L Sociable—Unsociable —.76 
1, Warm—Cold — 34 
| iris d 
Factor i iss —.34 
Refreshed—W. .65 
L Refreshed—Weary m 
E Refreshed—Weary .39 
TABLE 3 
Тепасйу 
ылыш чоо M 
Factor 6: Tenacity 
e Scale Loading 
T Severe—Lenient .81 
8 Tenacious—Yielding т 
8 Hard—Soft .57 
L Btrong—Weak "37 
L Severe—Lenient 37 
E Tenacious—Yielding .34 
Е Severe—Lenient 188 
Tenacious—Yielding 133 P 
——— l Tenio Yes эз 
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TABLE 4 
Stability 
Factor 7: Competitiveness ^ 
Role Scale Loading 
8 Competitive— Cooperative .60 
L Competitive— Cooperative .60 
E Competitive— Cooperative .80 
Е Excitable—Calm .32 
Factor 8: Calmness 
8 Excitable—Calm —.78 
L Excitable—Calm —.79 
E Excitable—Calm —.53 


resemble the factor Expressiveness identified in the earlier study. 
Tenacity, the second factor identified in the earlier study, remained 
largely intact through the image analysis of the data for all three 
roles. See Table 3. But Stability, the third factor in the earlier 
study, was broken into at least two factors, Competiveness and 
Calmness. 

As seen in Table 5, Potency remained largely intact, but the 
scale Masculine-Feminine emerged across all three roles as a’ sep- 
arate factor. Predictability divided into three factors, Sanity, Ra- 
tionality, and Stability; while Evaluation divided into a large 
number of factors. Two of these factors, Goodness and Optimism, 
each accounted for about 2 per cent of the total variance. They 
are listed in Table 7. 

Several points emerge from the above comparison of the princi- 


TABLE 5 


Factor 9: Bravery 


Role Seale Loading 
Biag Rugged—Delicate —.73 
S Brave—Cowardly —.38 M 
8 Strong—Weak —.25 
L Rugged—Delicate —.55 
ib Brave—Cowardly —.35 
Е Rugged—Delicate —.50 
E Brave—Cowardly —.88 


Factor 10: Masculinity 


S Masculine—Feminine —.66 
L Masculine—Feminine —.67 
E Masculine—Feminine —.81 
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TABLE 6 
Predictability 
Factor 11: Sanity t 
Role Scale Loading 
8 Sane—Insane —.59 
L Sane—Insane —.70 
E Sane—Insane — .68 
Factor 12: Rationality 
8 Rational—Intuitive .75 
L Rational—Intuitive .39 
Е Rational—Intuitive .46 
Factor 13: Stability 
L Stable-Changeable —.79 
L Rational—Intuitive —.37 


pal components analysis of the Self role data with the image anal- 
ysis of the data for all three roles. First, the image analysis of the 
data for all three roles indieated that the three separate roles 
were assumed by the subjects. Second, the total number of factors, 
51, from the image analysis was much greater than the total num- 
ber of factors from the prineipal components analysis. Many of 
the principal components factors split into two or more factors in 
the image analysis. For example, Evaluation splintered into many 
faetors in the image analysis. "Third, with the obvious exception of 
the three role factors, most of the factors cut across the three roles, 
Self, Liberal Arts Professor, and Education Professor. Factor 12, 
Goodness, is an exception. 

In two earlier studies (Husek and Wittrock, 1962; Wittrock, 
1964) the factor structure of the data varied with the heterogeneity 
of the population of subjects and with the variety of scales. From 
the above comparison of the two analyses of data, the factor struc- 


TABLE 7 
Evaluation 
Factor 14: Goodness 
Role Scale Daum: 
E Goodness—Badness .50 
Е ор Incomplete .43 
Factor 15: Optimism “рс 88 
8 Optimistic—Pessimistic —.48 
Е Optimistic—Pessimistic —.81 


По о т мї: 
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ture depends also upon the type of analysis, principal components 
or image analysis, as well as the variety of concepts rated in a 
study. The image analysis gave considerably more factors than 
the principal components analysis, but the difference in number of 
factors may be due also to the greater variety of concepts used in 
the data analyzed by the image method. 

Factors which contain only one scale, repeated over roles, should 
be interpreted with caution since the 3 х 29 design on the vari- 
ables will tend to induce patterns of correlation which make this 
likely. 

From the two earlier studies and from the present study, one 
conclusion seems to follow. The Semantic Differential should be 
used as a technique rather than as a test. The factor structure 
obtained in a semantic differential study depends upon several 
variables: the variety of concepts, the variety and number of scales, 
the heterogeneity of the population of subjects, and the type of 
analysis performed on the data. 
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EDITOR’S COMMENT: A NOTE OF APPRECIATION 


For nearly fourteen years this editor has had the pleasure and 
privilege of managing Tue Book Reviews of this journal. During 
this extended tour of duty he has met and worked with numerous 
reviewers whose efforts and contributions he wishes to acknowl- 
edge. Their cooperation and conscientious efforts have made pos- 
sible the presentation of four sections of reviews each year. In the 
editor's opinion the quality of workmanship of these contributors 
has added much to the value of the journal and to the dissemination 
of evaluative information concerning volumes that touch upon im- 
portant problem areas in psychology, education, and sociology. To 
all who have partieipated—book reviewers, publishers, and other 
members of the, editorial staff including Mrs. Geraldine Thomas 
and my wife, Joan—many thanks are extended. 

The publication explosion has hit this journal as it has all others. 
The volume of manuscripts submitted to the Vanity STUDIES 
Section has increased by more than 600 рег cent during the past 
year relative to what it was only four years ago. The responsibili- 
ties of this augmented workload have made it necessary for the 
editor to relinquish the assignment involving BOOK REVIEWS. 
Beginning with the spring, 1968 issue the new editors of this sec- 
tion will be Dr. Max D. Engelhart of Duke University, Durham, 
North Carolina and Dr. Henry Moughamian of Chicago City Col- 
lege. This editor is pleased that persons of such high professional 
competence will manage the BOOK REVIEWS. Every best wish 
is extended to Max and Henry for their making the book reviews of 
the future even more meaningful and helpful to the audience of 
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT. 

Because of the large increase in the numbers of books being 
published it has been decided that for the most part only books 
directly concerned with educational and psychological measure- 
ment and evaluation, statistics, research design, and computer tech- 
nology will be reviewed. Of course, any reviewers who have 
already prepared evaluative comments on books which may not 
fall into the categories just cited will have their reviews pub- 
lished. The introduction of this specialized emphasis may be ex- 
pected to result in somewhat more penetrating reviews than may 
have been written in the past. In any event, comments from readers 
of this journal will be welcome, for it is they whom we all wish to 
please. 

WILLIAM B. MICHAEL 
Editor 
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Scales for the Measurement of Attitude by Marvin E. Shaw and 
Jack M. Wright. New York: MeGraw-Hill Book Co. 1967. 
Pp. xxii + 604. $14.50. 


This book—the first of its kind—is long overdue and welcome 
to the field of attitude measurement. In large measure, it performs 
the valuable service of providing a convenient literature survey 
of 176 of the better attitude inventories. The book, however, is 
more than a catalogue. Chapter one, for example, sets the stage by 
attempting to integrate traditional conceptualizations of attitude. 
This exposition is highlighted by a rejection of the orthodox con- 
ception of attitude consisting of three components, cognitive, af- 
fective, and behħvioral, in favor of the view that only affect con- 
stitutes the attitude, per se. Cognition, on the other hand, serves 
as the direct means by which attitudes are modified, and behavior 
as the means by which attitudes find expression. One potential 
value of the unyoking of these components is to make one less 
compelled to assume an isomorphism of attitude and behavior. 

Chapter two is a summary review of old and new methods of 
attitude scale construction, serving as an updated and condensed 
version of Green’s (1954) presentation. Though, in general, the 
Green chapter remains the better source for the neophyte, the 
present discussion serves well as a briefing and/or refresher course 
for one in the practical business of measuring attitudes. 

The major substance of the volume lies within Chapters three 
through ten, where the attitude scales themselves are presented. 
Each of these eight chapters represents a separate broad category 
of social objects; viz., social practices, social issues and problems, 
international issues, abstract concepts, political and religious atti- 
tudes, ethnic and national groups, significant others, and social 
institutions. Each chapter includes anywhere from 15 to 34 scales, 
mostly Thurstone or Likert, and relevant to an array of subcate- 
gories. For example, in Chapter three, attitude inventories related 
to such social practices as family relationships, education, religion, 
heterosexuality, health, and economies, are reported. Exhibition of 
specific attitude scales serves a further breakdown into more or 
less specific attitudinal referents. For example, within the subcate- 
gory of family relationships one finds measures of attitudes about 
discipline of children, self-reliance, parents giving sex information 
to children between the ages of 6 and 12, and related topies. Nor 
is a particular attitude limited to a single representative scale, par- 
ticularly for those which have been more thoroughly investigated, 
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perhaps associated with the most salient social issues (e.g., five 
Scales of attitude toward Negroes are included). 

Copies of all scales are presented in full, including self-admin- 
istering instructions and items, so that any can be lifted intact for 
use. Pertinent supplementary information about each scale is also 
given, such as a brief history of its development and use, size and 
type of subject population used, response mode, scoring proce- 
dure, reliability and validity information, and a critique. 

The final chapter in the book 18 devoted to an overall evaluation 
of the scales, The tenor of this evaluation is one of criticism. It is 
pointed out, for instance, that there have been few major break- 
throughs in technique of attitude assessment since Thurstone and 
Likert, and that even the newer techniques which are available are 
rarely, if ever, used. Test constructors are further criticized for 
inattentiveness to such important considerations as whether the 
judges' own attitudes influence favorableness judgments of items, 
and whether the items are monotone or nonmonotone. The au- 
thors also show a healthy regard for reliability, validity, unidimen- 
sionality, equality of units, and zero-points, indicating concern 
over the lack of evidence thereof. 

However, in spite of these seemingly overwhelming deficiencies, 
the authors do admit that most of the scales they report are re- 
liable and “satisfactory for most research purposes.” The defi- 
ciencies, they point out, apply mostly to considerations of 
individual assessment, 

Finally, a number of helpful suggestions are offered to amelio- 
tate the situation. Some of these are reminders of good practices 
typically covered by test construction manuals, Others are less 
routine, such as the recheckin 
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Improving Experimental Design and Statistical Analysis by Julian 
C. Stanley (Editor). Chicago: Rand-McNally and Co., Inc., 
1967. Pp. ix + 308. 


The Seventh Annual Phi Delta Kappa Symposium on Educa- 
tional Research was conducted at the University of Wisconsin in 
August of 1965. The five papers presented, and a sixth by Donald 
Campbell, which was added later, have been published as Improv- 
ing Experimental Design and Statistical Analysis. In addition to 
the paper by Campbell, papers by Julian C. Stanley, George E. P. 
Box, Ingram Olkin, Leslie D. McLean, and Frank B. Baker are 
contained in this volume. In addition, some 14 discussants partici- 
pated in the symposium. The participants represent an extremely 
high-powered collection of some of the most able and creative 
members of the research community. It is not surprising, then, 
that the papers are almost without exception of high quality; nor 
is it surprising,that there is considerable variety with respect to 
both the content and the provocativeness of the individual papers. 

The interchange of ideas which occurred during the (closed) 
discussions make for some of the most stimulating reading about 
research that can be found in print anywhere. 

The lead paper in the symposium was Stanley’s survey of the 
articles published in the 1964 American Educational Research 
Journal (in which “by generous interpretation” there were 12 true 
or quasi-experiments among the 26 articles) and a renewal of his 
call for better training of educational researchers in experimental 
design. In connection with the latter, he described the University 
of Wisconsin’s Laboratory of Experimental Design (LED), which 
is intended to turn out specialists in experimental design and anal- 
ysis, and discussed methods of recruiting able graduate students 
and upgrading current researchers and graduate students. 

A major portion of the discussion of Stanley’s paper dealt with 
the problems of training persons to have not only quantitative 
and methodological sophistication (which LED graduates most as- 
suredly do have) but also sufficient knowledge of the substantive 
areas to be able to apply their training to the problems of educa- 
tion. There was a tendency for the discussants to divide into two 
groups—those affiliated with LED defending that program against 
the suggestions by the others that such a program might neglect 
the substantive areas of education in a manner which would in- 
hibit their abilities to make important contributions to the im- 
provement of education. 

Although Box’s “Bayesian Approaches to some Bothersome 
Problems in Data Analysis” is an excellent presentation of, and 
illustration of, the Bayesian way of thinking, the paper fails to 
relate this approach to the kinds of problems typically encountered 
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by the educational researcher. To illustrate this point, consider 
that his discussion of normality and of transformations was il- 
lustrated with biological data, while the section on non-stationary 
time series emphasized économie ‘and industrial applications and 
was illustrated with artificial data. Although the typical educational 
researcher will find much of this paper rather technical, he should 
at least be aware of the kinds of possibilities the Bayesian approach 
presents. 

As Schutz noted (p. 101), what followed Box’s presentation was 
"more . . . а dialogue than a discussion.” A: recurrent theme 
throughout this discussion was that’ frequently the Bayesian and 
classical approaches to data analysis both involve the same com- 
putations (when an “indifference” distribution is used as a prior), 
but that the justification for the analysis is different. At one point, 
Box noted (p. 83) that “Fisher’s outlook is very Bayesian. . . . Sav- 
age expressed it very well when he said that Fisher tried to make 
the Bayesian omelet without breaking the Bayesian eggs. . . . 
Fisher is really halfway to Bayes anyway, but he did not accept 
the use of diffuse prior distributions and introduced the fiducial 
idea to avoid them.” Although this discussion was basically a dia- 
logue, it was a most informative dialogue. Every educational re- 
searcher should at least be exposed to Bayesian analysis, and there 
is probably no better source for this exposure than G. E. P. Box. 

With the emphasis in the symposium on experimentation (“соп- 
trolled, variable-manipulating experiments”), it is gratifying to find 
Tngram Olkin talking about correlation. What Olkin has done is to 


review some of the procedures and problems in tests of hy- 
potheses and estimation for one and for more than one correla- 
tion coefficient. Of particular concern are some vintage prob- 
lems of correlational analysis which have been reconsidered 
recently and upon whieh we can shed some new light. An 
attempt is made to consider alternative models for some of 
these problems. (p. 103). 


у Не considered such problems as estimating p from т, and estab- 
lishing confidence intervals for p; testing hypotheses of the form 
и. iid where Xo, X, and X; are three correlated variables 
M us jede to Fisher's 2 is generally inappropriate!) ; and estab- 
* Б confidence limits for p12 — ps4, where for example X; and 

2 are pretests (at time #,) of two attributes and Хз and X, are 
posttests (at time t2) of these same attributes. 

The discussion of this paper began with an emphasis on con- 


sidering the underlying model or models when considering any 


analysis. As Olkin noted, an i i 
ie that % fondo DU ae Wd nh advantage of this approach 


: the underlying assumptions. This 
concern with "the model" was one of the two ajo ees of the 
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discussion, the other being the problem of communication between 
researchers and statisticians. One way to ease this problem, the 
participants emphasized, is to’ provide the researcher with an ade- 
quate statistical training, so that he can present his problems to the 
statistician in a meaningful fashion and understand the latter’s com- 
ments or suggestions. 

McLean’s paper returned to considerations of “experimenta- 
tion,” as he discussed the uses of incomplete designs (balanced 
incomplete blocks, Youden squares, Latin squares, fractional fac- 
torials), considering principles of balance and orthogonality and 
emphasizing the “design decisions” rather than details of analysis. 
The presentation provides a valuable reference for researchers, 
and its availability should lead to an increase in the frequency 
with which such designs are employed. 

Baker’s paper, “Experimental Design Considerations Associated 
with Large Scale Research Projects,” is probably the least stimu- 
lating of the collection, but even it provided a springboard for 
some lively discussion. In calling for a new class of experimental 
designs, Baker emphasized the need to take into account the lack 
of careful definitions of problems, lack of a univariate criterion 
variable, involvement of an interdisciplinary team of researchers, 
the fact that projects deal with problem areas rather than specific 
problems, and the existence of a “management hierarchy” which 
needs research guidelines. All of these characteristics of the cur- 
rent research milieu result at least in part from the growth of 
large-scale research projects, heavily influenced by the availability 
of federal funds in large quantities. 

Contributions to this “new class” of designs are envisioned as 
coming from such various sources as PERT, the Program Evalua- 
tion and Review Technique used so successfully in the develop- 
ment of the Polaris weapons system; the “response surface 
designs” developed by Box; and operations research, characterized 
by a systems orientation, the use of interdiseiplinary teams, and 
the adaptation of the scientific method. (The omission of Bayesian 
statistics, which in their application to decision-making seem par- 
ticularly appropriate to a programmatic kind of research, is 
surprising.) 

Campbell’s paper is basically a plea to educational administrators 
to “do research.” That is, whenever an administrative change is 
introduced, records should be kept and “nonreactive measures” 
taken to determine the effects of the change. Campbell states his 
theme (p. 261) as follows: 


If in borrowing (from the experimental social sciences) one 
needs to cross-validate with administrative experiments any- 
way, why don’t we just start experimenting where we live? If 
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we are doing experiments with samples of an administrative 
principle, we are thus providing a basis of experimentation that 
is elose to home, where the extrapolation is small, and where 
the likelihood of valid extrapolation is great. 


He calls for greater use of the institutional records already being 
kept to reveal the effects of changes, warns about various potential 
sources of invalidity (cf. Campbell and Stanley, 1963), and dis- 
cusses several specific ways in which his suggestions can be car- 
ried out. 

In summary, this book contains a wealth of information, and 
even more speculative discussion, which should stimulate any edu- 
cational researcher, Bayesian analysis, research training, correla- 
tional analysis, incomplete experimental designs, programmatic 
research, and administrative experimentation are all discussed and 
many methodological and philosophical ideas are presented. With 
an exhaustive list of references and an adequate index, Improving 
Experimental Design and Statistical Analysis is one of the very 
best of the published Phi Delta Kappa Symposia, and a valuable 
addition to the libraries not only of educational researchers but 
also for all who are concerned with the methods and philosophies 
of the behavioral sciences, M 
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It is rare to find useful novelties in the i 

Bn presentation of an ap- 
pronen to Peres i statistics. In the apparently insatiable market 
will мо rà tistics texts which has produced (and probably 
іна xy NI a host of not very idiosyncratic permu- 
dell toca ^ "x than original theme, anything that disturbs the 
bos recht out immediately. The work of Games and Klare 
I унанд. а volume that will be worth the while of most 
instructors to examine, and many students to own—particularly 
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- those students who find statistics overwhelming. In fact, two of 
‘the worst features of the book are the printing format (which 
looks archaic) and the introduction (which tends to the egocen- 
tric). This must be one of the poorest printing productions that 
the publishers have sponsored, even allowing for technical diffi- 
culties in the demands made by the authors. And the authors did 
not really need to go so far in “selling” their product. 

But these faults can be overlooked in view of the many positive 
aspects of the book, The content is in no way remarkable—it is 
— the usual first-course material, going as far às tests for the statistical 

- significance of differences between means. It is in its format and 
= inits style that the book is notable. 
The method of writing is an interesting experiment in itself. 
_ By using short sentences and a redundant style, the book ap- 
_ proaches a programmed—instructional format. In effect, the ma- 
- terial is presented several times, each time in a different way, and 
always with ag little reliance on mathematics as the authors can 
manage, so that even the most numerically naive student should, 
with a little application, grasp the “point” of the material. It is to 
this clarity of presentation, the novel and relaxed style (though 
the humorous asides begin to pall after a while) that the book 
- owes much of its worth. The authors may have made a tactical 
mistake in addressing the book in the first person to "sophomores 
and juniors," It is a useful first-course book at whatever level, and 
one could conceive of more academically advanced students being 
— left out of much of the flippancy. The instructor, by the way, is 
- — addressed through footnotes, and there is also a handbook available 
for him (though not for review). 
- Although the book is aimed primarily at students who are weak 
‘in mathematics, it is not a watered-down text conceptually. There 
‘is an appendix that attempts to set out the “required and optional 
skills” for using the book, but it is not a particularly profound one. 
Th fact, a student who was in as much need of mathematical help 
as the appendix implies would be better off with other source 
material. It is also possible that the more capable student would 
become bored with the overly patient explanations, but he would 
— hot be short-changed by the book's substance. Verbally, the qual- 

_ity is high; numerically, the quality is not so demanding as other 
texts on the market. The exercises, for example, are tedious rather 
than difficult, Whether this imbalance is good, or bad, is a matter 
_ for individual evaluation. 

The unusual feature of this book takes the form of fold-out 
sheets preceding each chapter, and constituting one appendix. The 
Sheets are so folded that when they are fully extended, relevant 
material projects beyond the body of the book and is thus available 
for constant reference. For instance, a given sheet presents the 
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symbols used in the following chapter, and the definitions, so 
that it is possible for the reader to keep in sight an explanation for 
each symbol. The appeal and advantages of this kind of device are 
obvious, especially for the more frequently-used tables, which can 
be referred to without fumbling through a maze of appended 
material. The fold-out sheets also have (on the obscured portion, 
usually) the mathematical relationships (formulae, equations) per- 
tinent to its next adjacent chapter, and on the back of each sheet, 
review problems for the preceding chapter. Again, the problems 
will show when the book is closed: some answers (about half) are 
given, and are printed on the covered portion. If the description 
of this device is confusing, the reader should feel assured that in 
practice the effect of the fold-out sheet is quite the opposite. 

There is one notational usage of interest, too. An asterisk (*) is 
used to indicate multiplication: the analogy to FORTRAN is ob- 
vious. Otherwise, the symbols used are quite standard, and the 
student should have no difficulty in making transitions to and from 
other statistics texts. 

There are five major divisions to the text. Part A is a general 
orientation, a general vocabulary session, and also contains a sec- 
tion on levels of measurement (which raises some potentially in- 
teresting opposition to Steven’s nomenclature). Part B. scarcely 
needs comment: it contains the usual fare of descriptive statistics— 
frequency distributions, percentiles, measures of central tendency, 
measures of variability, and standard scores. Part C covers the 
essentials of statistical inferences: probability, hypothesis testing, 
t tests. The whole section is remarkably well done considering the 
avoidance of mathematical symbolism. Distributional properties 
are approached in a thoroughly enlightening fashion; point- and 
interval—estimations are also well done. The exposition has been 
ably assisted by the frequent use of diagrams and charts. ¢-testing 
has been reduced to the mechanical level. Cookbook directions 
and examples are provided for those wishing to test the statistical 
significance of the difference between means. In view of the au- 
thors ingenuity elsewhere, the reliance on the blind-flying ap- 
proach is a little surprising. 

Part D, “bivariate descriptive statistics” covers the Pearson 
product-moment correlation coefficient and stabs at linear тертез- 
м e underlying rationale to correlation is explained (ver- 
west the DO at length, and is supplemented by a 

illustrative figures and tables, and some excellent nomo- 
graphs. With such & good start, the avoidance of other correlation 
coefficients than simpler т is a disappointment. Part E concludes the 
Tu о of the text with a section on "experimental design." 

d a ors do little but grasp the opportunity to draw together 
and reiterate the points that have already been made, but which 
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are of especial concern in setting up an experimental design. The 
added (new) portions are good—attention is drawn to several mis- 
usages of statistics semantic traps that beset:the experimenter, and 
useful vocabulary is established. Thus, discussion centers around 
such topics as the difference between “significant” and “impor- 
tant,” dependent vs. independent measures, matched vs. independ- 
ent groups, experiments vs. investigations. 

The appendices contain a great deal that would, in other books, 
be included in the main body of material. Topics such as chi- 
square, the F test, and simple analysis of variance are relegated to 
this area of the text. ANOVA is poorly treated in the effort for 
simplicity, and in fact some unjustifiable practices are encouraged, 
such as “the usual procedure is to conduct # tests on all possible 
pairs of means” after a significant F has been found. Effusively 
cookbook directions given for computations, with the sudden ap- 
pearance of previously unexplained vocabulary, make this section 
one of the weakest in the book. Although the purpose of the 
appendix is to present the less elementary material in recipe-like 
form it generally fails to impart any substantial understanding of 
the meaning and purpose of these statistics. 

The volume, then, has its ups and downs. For those who are 
afraid of the numerical aspect of statistics there should be ample, 
clean, verbal explanation to enable them to appreciate and to use 
the vocabulary of statistics. The worksheets should also help these 
people through the basic computations and interpretations. The 
fold-out sheet has been most effectively employed, and the il- 
lustrations and graphical material are excellent. Although the au- 
thors have tried to maintain a simple and relaxed style, their verbal 
clowning at times becomes foolish. And their simplicity has on 
rare occasion led to too naive an exposition. On the whole this is a 
book that will be attractive to more people than it offends. 


Ретев А. TAYLOR AND Susan Е. Forp 
Rutgers University 


Educational Statistics: Use and Interpretation by W. James Pop- 
ham. New York: Harper and Row, 1967. Pp. x + 429. 


The author states that “this book was written in reaction against 
my own experience as a student in educational statistics classes. It 
was evident that many of my fellow students in those classes were 
essentially going through the motions necessary to pass the course 
but were acquiring little real understanding of the statistical con- 
cepts which were treated. Subsequent conversations with num- 
erous educators from all parts of the country confirmed my 
opinion that only a small collection of professional educators pos- 
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sess real confidence regarding their ability to use and interpret 
statistical methods.” (p. ix). 

As a solution, this book is offered in which “the emphasis . . . 
will be upon the common-sense rationale underlying the statistical 
methods described. It is hoped that by learning how a statistical 
technique operates, the reader will be able to understand the pur- 
pose of the technique and in what types of educational situations 
it should be used. Whenever possible, statistical concepts . . . are 
described in a verbal or even graphic fashion. Mathematical ex- 
planatory techniques are reduced to an absolute minimum. No 
derivations of formula are included.” (p. 3). 

It is probably true that professional educators either have had no 
training in statistical techniques at all, or have been subjected to a 
course which has sacrificed the basic understanding of the student 
by attempting to cover too much in too short a time, usually 
depending upon a mathematical sophistication in the student which 
in fact does not exist. The reaction by the student to this latter 
experience is the subsequent avoidance of anything resembling 
Measurement, once the dissertation is complete, which itself is an 
agony of imitating the competent statistician. 
| "The author's attempt to correct this situation is a worthy ambi- 
tion; the question remains whether or not he has succeeded. Gen- 
erally speaking, this book does make a number of valuable contri- 
butions to the avowed purpose, but at the same time it is marked 
by some flaws that seriously detract from the success. Some of the 
flaws are predictable from the nature of the task the author has set 
for himself and from the small size of the text: his highly ver- 
balized exposition must be superficial and sketchy on some topics 
of importance, and other points that should be included are simply 
omitted. 

In his introductory chapter Popham describes measurement in 
terms the educator is likely to encounter, discusses the general 
uses of statistical methods, and quite properly cautions the reader 
about the hazards of practical decision-making from the results of 
a statistically significant experiment. “There is a crucial difference 
between a Statistically significant result and a practically signifi- 
cant result. While knowledge of statistics is an invaluable asset to 
the educator, it is his proper responsibility to be guided by statisti- 
cal results; he should not be led by them.” (p. 6). 

Chapter Two presents a brief summary of descriptive statistics. 
iu Ari и variance given is the unbiased estimate of popu- 
I е, а usage that is consistent throughout the text. A 

blocked out bar graph" presentation of the standard deviation is à 
heuristic device that is a fairly successful graphical explanation. 
Happily, grouped data techniques are ignored. 

The following chapter on the normal curve is terse and com- 


BOOK REVIEWS 885 


petent, if not inspired. In Chapter Four, the reader is initiated into 
the logie of inferential statistics. Probability theory is not treated 
formally, but the author does touch on such essential topies as the 
null hypothesis, the definition of a variable, populations and sam- 
ples, random numbers, significance levels, types of error, power of 
a test, one and two tailed tests, and point and interval estimation— 
all within 24 short pages, including problems and a selected bib- 
liography. 

The next group of Chapters, 5 through 17, come in pairs, one 
chapter for the exposition and the following one for numerical 
examples. The chapters on correlation deal primarily with prod- 
uct-moment, partial, and multiple correlations. Several helpful 
graphical illustrations are provided. Formulas are given for multi- 
ple correlation with two predictors, and for first and second order 
partial correlations. Confidence intervals for r are discussed in 
terms of the Fisher Z transformation. The appropriate use of the 
special correlatjons (phi, biserial, point biserial, and tetrachoric) is 
discussed, but no formulas are given. 

Simple linear regression is well presented in Chapters 7 and 8. 
By contrast, the solution of a two-predictor multiple regression 
equation in Chapter 8 requires no little algebraic facility on the 
student’s part. Perhaps only a discussion of multiple regression and 
an example of its use would have been sufficient. 

After reading Chapter 9 and working through the examples of 
Chapter 10 the student should know what ¢ test to apply under a 
variety of conditions—no mean accomplishment—and in general 
the exposition is clear and straightforward. 

Analysis of variance is discussed in the next four chapters with 
a lack of rigor that is excusable only by the author’s avowed goal 
of emphasizing the common-sense approach. However, the suc- 
cessful student should be able to compute simple and multiple 
classification problems, and the graphical explanations may en- 
hance the student’s feel for what is going on in analysis of vari- 
ance, Quite unfortunately, however, the student will be led badly 
astray by the subsequent discussion of interaction. 

As an illustration of interaction, a four by four table of mean 
Scores is presented (p. 196) that consistently decline in value from 
left to right and from top to bottom. What this shows, of course, is 
Consistent main effects for both variables and no interaction what- 
зоеуег. The difference between the A values for each condition of 
B is exactly the same, and vice versa; interaction would have to be 
illustrated by a table showing the difference between the A values 
changing as a function of the B values, or vice versa, or perhaps 
reversed A values as a function of the B values, as the author 
Presents so clearly in a graph on the following page. The accom- 
Panying text indicates that this is no typographical error, but a 
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conceptual one. The potential user of the book is cautioned against 
permitting students to gain their “common-sense” understanding 
from this unfortunate passage. 

Although the student, with sufficient patience, might be able to 
compute the relatively complex analyses of covariance described, 
these chapters are unsatisfactory in several respects. No graphic 
explanations are offered. The assumption that the values of the 
covariates should be unaffected by the treatment variables is not 
emphasized. The complexity of the designs discussed here is not 
consonant with the level of material presented elsewhere in the 
text, and the student is likely to lose his grasp of the generalities 
while grappling with the details. Perhaps a better strategy would 
have been a complete discussion of the single classification with 
one covariate design, with illustrations, rather than cookbook solu- 
tions of a one-way design with two covariates, followed by a 
two-way design with three covariates. 

In spite of its brevity, Chapter 17 should give the student some 
notion about the rationale of factor analysis, as well as the 
terminology. 

Chapters 18 and 19 on non-parametric statistics constitute prob- 
ably the most satisfactory section in the book. The techniques 
discussed include Chi-square, Wilcoxon Matched-Pairs Signed- 
Ranks Test, Mann-Whitney U Test, Sign Test, Friedman Two- 
Way Analysis of Variance by Ranks Test, Kruskal-Wallis One- 
Way Analysis of Variance, and the Spearman Rank Correlation 
Coefficient. Appropriate tables for All of these tests are included 
in the Appendix. One error occurs in these chapters, specifically in 
the exposition of the chi-square goodness to fit test for normality. 
The degrees of freedom are presented as being equal to the num- 
ber of categories minus one, but in fact they should equal the 
number of categories minus one degree for each of the constants 
used in the fitting process. In the example, these are the sample 
size, the standard deviation, and the mean, therefore the degrees 
of freedom should be the number of categories minus three. 

The final chapter, Choosing the Appropriate Technique, is an 
excellent innovation for this type of text, Presented in branching 
self-instructional format, this chapter should increase the student’s 
basic conceptual understanding of when and why the various sta- 
tistical techniques should be used, a point that is often lost to the 
student among the details of a more thorough text. 

A further contribution to pedagogy is the presence of an abun- 
dance of simple and pertinent examples and exercises at the end of 
each chapter, and even answers to the discussion questions are 
supplied at the end of the text. The availability of computers 
makes it increasingly inappropriate to ask students to spend long 
periods of time working through large-N examples of each new 
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technique. Far more learning is likely to take place from small-N 
examples and problems which can be grasped as a whole, leaving 
the repetitious calculations to the computer. For the most part 
this text follows this philosophy, with the noted exceptions in the 
chapters on covariance and multiple regression. 

Because so much of educational research concerns tests of one 
kind or another, one bit of knowledge that the professional educa- 
tor will need is at least an elementary knowledge about the concept 
of reliability, and in fact, the author presents in his introductory 
chapter a split-half reliability coefficient as an example of the sort 
of thing which might bewilder the untrained. But that is the last 
time the split-half or any other kind of reliability is encountered 
in the text, save for a mention that two supposedly equivalent 
forms of a test can be correlated as a check on their equivalence. 
A few pages on reliability would have helped fill this large gap in 
the educator’s ability to interpret research. 

At the conclysion of a course utilizing this book as its text, the 
student with little training in mathematics should have an elemen- 
tary understanding of the several statistical techniques discussed. 
His ability to interpret the research literature should be improved 
and his diffidence about statistical manipulations should be less 
than if he had been subjected to a course requiring a more rigorous 
background. It would seem that the aims of the text have been 
realized. With the very important qualifications noted above, the 
text could be used successfully with the terminal bachelor’s or 
master's candidate who is otherwise not likely to acquire these 
necessary concepts. 


УУпллам M. бтамлмо8 AND Комлір L. FLAUGHER 
University of Illinois, Champaign 


Probability and Statistics in Psychological Research and Theory, 
by Donald W. Stilson. San Francisco: Holden-Day, 1966. Pp. 
xii + 507. 


It is tempting to label Stilson’s book a “text for a descriptive 
course in mathematical statistics” because of its broad but essen- 
tially shallow treatment of its subject matter. But to do so would 
both slight this very interesting book and bypass a question about 
the teaching of educational and psychological statistics courses that 
demands an immediate and searching answer. 

The book itself is described by the author as appropriate for a 
Second course in statistics for advanced undergraduates or begin- 
hing graduate students, but previous courses in statistics or training 
In mathematics beyond algebra are not presumed necessary. Stilson 
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has three explicit purposes: (1) to cover the fundamentals of prob- 
ability and statistics; (2) to cover these topics especially as they 
apply to psychological research; (3) to give the psychologist some 
command of the statistician’s language. The second goal is pursued 
by including in most chapters a mathematical formulation of some 
psychological problem and following it up with some “intuitive 
clarification.” These sections, particularly those concerned with 
learning theory, are rather good and should help keep student 
interest from flagging. 

The organization of the book is more like that of the usual 
mathematical statistics text than that of a psychological statistics 
book. The introductory chapter discusses the aims and purposes of 
statistics from several different points of view. The second chapter 
gives a short history of probability and the statistical method up to 
the development of decision theory. Chapter three introduces set 
notation and some basic theory and develops some ideas of popula- 
tions and samples in these terms. The fourth chapter is a very 
thoroughgoing treatment of the fundamental concepts of prob- 
ability. Much of the exposition is in terms of sample spaces and set 
concepts. Probability calculations are treated in the fifth chapter 
together with the law of large numbers and Bayes’ theorem. Chap- 
ter six has as its purpose “. . . to develop an efficient language for 
talking about probability in connection with variables.” It is rather 
good formalization of many of the topics usually reserved for a 
Course in “measurement.” The basic ideas of probability distribu- 
tions and density functions, including multivariate distributions, 
are presented in chapter seven. Most teachers will find themselves 
teaching more prerequisite mathematics than anything else here. 
Expected values, their estimation, and some general properties of 
estimators are developed in chapter eight. Chapter nine reaches the 
topics so many applied texts start with, descriptive statistics. Central 
tendency, variability, skewness, correlation, degrees of freedom 
(including a geometric interpretation that loses the forest in look- 
ing for the trees) are all discussed in this chapter. Chapter ten 
details some of the more common and some of the more useful 
ey distributions and density functions. These include the 
] казна р hypergeometric, multinomial, and Poisson; the rectangu- 
2 We › xo hone The eleventh chapter presents the ideas 
ео i di utions, The evaluation of the properties of point 
ie due ; and the extension of the idea of estimation to intervals 
presenta pod ecu d twelve. The thirteenth and last chapter 
НЕ. testing and classification procedures. All three 
tea atistics, description, estimation, and hypothesis 
io ve Tuum and in a logical and reasonable order. It should 
ee T | iss contains many sets of problems, most of 

д proximity to the sections to which they per- 
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tain. Each chapter has a list of suggested readings at the end, and 
these almost without exception are excellent. 

There are two levels at which this text should be evaluated. 
First, how well does Stilson accomplish what he sets out to do? 
If we accept his statement of goals literally, probably rather well. 
The fundamentals of probability and statistics are set forth well 
and clearly. The psychologist’s interest in these areas is well il- 
lustrated and documented. The student who uses this text will 
have some command of the statistician’s language and perhaps 
even thought patterns. But at another level, the question that Stil- 
son’s approach raises is the old but increasingly important one of 
what will be taught in applied statistics courses. 

What psychologists have done in the past and what they still do 
in the main is to teach techniques. They show students procedures 
that apply to certain limited classes of problems and then drill them 
in the computational details of the techniques, in rote memoriza- 
tion of the assumptions underlying the techniques, and in their 
limitations, i.e., how far can one push them without becoming 
ridiculous? Stilson, on the other hand, together with an increasing 
number of applied statistics teachers, attempts to establish an un- 
derstanding of the mathematical and statistical qua statistical think- 
ing and rationale that forms the basis of probabilistic description, 
estimation, and hypothesis testing. 

The reasons for favoring Stilson’s approach are almost too ob- 
vious to mention. But they boil down to the fact that a student 
who understands what statistics is about is not limited to a few 
techniques that he has mastered, or, more importantly, to a few 
problems or to a relatively delimited problem the results of which 
he has learned to analyze. 

Why not, then, embrace Stilson’s book, his viewpoint, the whole 
approach? The answer to that question is one any statistician loves: 
an interaction. Given the mathematical preparation the majority of 
social and behavioral science students bring to their graduate work, 
Stilson’s approach is not feasible. An hour a day five days a week 
for a year would not suffice to get through the text. Much of the 
lecture time would go to teaching mathematics and mathematical 
procedures some students lacked. At the end of the course, most 
students would still not know what statistics was about. Learning 
both the requisite mathematics and some statistics would have 
provided too much material at too fast a pace. At the other end of 
the interaction, however, the well-prepared or very able student 
might find Stilson a truly educational and enlightening experience. 
He might well learn to think like a modern statistician and to talk 
statistics with a statistician. But suppose he took his degree and 
went someplace where there was no statistician? This is not un- 
likely, since there are only about 5000 qualified mathematical sta- 
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tisticlans in the world. What would he, much less the not-so-able 
student, do in that situation when faced with the need for a tech- 
nique with which to analyze the results of an experiment? If he 
was lucky and the problem was a common one, any of the usual 
iechniques-oriented texts might be sufficient. If the problem were 
at all out of the ordinary, he might go to advanced texts in mathe- 
matical statistics and he might even be able to read them. He 
would find most of these books enlightening on the statistical 
theory relevant to the problem, but of no use whatever in the 
direct application. 

So finally, it seems that neither of two approaches to teaching 
applied statistics leads to the ideal product: a student conversant 
enough with statistics to apply techniques imaginatively in novel 
or unusual situations. The able student trained by Stilson’s method 
might be able to communicate his needs to a statistician, but with- 
out a statistician’s help he is likely to be no better off in the end 
than the traditionally trained, techniques-orientedy highly limited 
student. 

A dark picture? Very. But not one without obvious if somewhat 
painful solutions. The first step is to stop regarding the behavioral 
and social sciences as semi-humanities and therefore to stop select- 
ing humanities students for graduate (social or behavioral) science 
programs. (Perhaps the more basic question is whether a humani- 
ties program that does not contain a heavy dose of mathematics 
and so neglects 5000 years of human thought is truly humanistic.) 
Beyond this, more complete and rigorous undergraudate ground- 
ing in mathematics is a necessity. Third, an undergraduate course 
in statistical theory should be required of social science majors. 
Lastly, graduate training in statistics should begin with a book like 
Stilson’s, should progress to an intermediate course in statistical 
theory, and should wind up with a course providing an integrated 
approach to applied statistics, e.g., Stilson, volume 2. 

Stilson has a good book and a better idea. The book cannot be 
useful in present graduate psychology programs because the aver- 
age student is not prepared for it. But the idea the book represents 
is one whose implementation is fast becoming a necessity if the 
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JAMES А. WALSH 
lowa State University 


Statistical Methods (2nd ed.) b 
E -) by Allen L. Edwards. New York: 
Holt, Rinehart and Winston, Inc., 1967. Pp. 2 E 462. ve 


A few years ago when the revie 
^ n wer was a graduate student, 
many different psychological statistics books, each having its own 
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peculiar notion, were in use. It always seemed to him, however, 
that the students who learned from Allen L. Edwards’ “green 
book" were a bit more sophisticated than the rest of the students. 
Since then, that impression has been confirmed on numerous oc- 
casions, and the reviewer has helped his students to weather their 
statistics requirements by using Edwards’ “red book," “green 
book,” or “blue book,” depending on their background and the 
level of the course. Therefore, he was delighted to receive a copy 
of Edwards’ Statistical Methods (still the “green book"), a revision 
of his 1954 Statistical Methods for the Behavioral Sciences. The 
new book, like its predecessor, is appropriate for upper under- 
graduate and beginning graduate students having some knowledge 
of algebra. However, Appendix A of the book gives a review of 
elementary mathematics for those who need it. 

The new book is shorter than the old one by approximately 80 
pages, and the notation has been changed somewhat to make it 
consistent with *the notation used in Edwards’ Experimental De- 
sign (the “blue book”). Chapters 1 through 9 of the new book are 
revisions of Chapters 3 through 11 of the old book, with the 
exception of Chapter 9 (old book) on random errors of measure- 
ment, which has been omitted. These chapters cover the topics of 
central tendency, variability, frequency distributions, linear regres- 
sion and correlation, other measures of association, and the 
binomial distribution. Three new chapters have been added: Chap- 
ter 8—Sets, Samples, and Random Variables; Chapter 11—One- 
Sided and Two-Sided Tests and the Power of Tests; Chapter 15— 
Experiments Concerned with Change in Performance Over Trials. 
The material in Chapter 12 (old book) on the normal distribution 
has been incorporated into Chapter 9 of the new edition, and 
Chapters 13 and 14 (old book) on ¢ tests for independent samples 
and paired observations are covered in Chapter 10 of the new book. 
The material in Chapters 15, 16, and 17 of the old book, which 
dealt with the significance of correlation and regression coefficients 
and an introduction to analysis of variance, is in Chapters 11, 12, 
and 13 of the new book. Finally, the topics of chi-square and rank 
order statistics, discussed in Chapters 18 and 19 of the old book, 
are presented in Chapters 16 and 17 of the new. The new book has 
à glossary of symbols, but the list of statistical formulas has been 
omitted, a perhaps unfortunate omission from the standpoint of 
the mathemaphobie student. Also, it would have been helpful to 
many students if a glossary of terms had been included. 

Needless to say, the book is well-written. There is more material 
here than in the earlier edition on probability theory, analysis of 
variance (multiple comparisons, trend tests, power of tests), and 
а variety of non-parametric tests is also discussed. Also, the exten- 
Sive sets of problems (examples) at the ends of the chapters, the 
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answers to which are given in Appendix C, should provide ade- 
quate practice. _ 
e Lewis R. АткЕм, JR. 
Guilford College 


An Introduction to Psychological Statistics by Philip H. DuBois. 
New York: Harper and Row, 1965. Pp. 530. 


Because of the plethora of new books on introductory statistics, 
an instructor, in choosing a text, is forced to rely increasingly on 
published reviews rather than on the shared experiences of col- 
leagues, Perhaps the reviews would be more valuable if they repre- 
sented evaluations based on classroom use—as well as on an (often 
cursory) examination. Although DuBois’ text has been reviewed 
earlier by Peter A. Taylor in EDUCATIONAL AND PSYCHOLOGICAL 
Measurement (Winter, 1965), these reviewers feel that many 
points were left unexplored. This feeling was engendered by using 
DuBois as a text with advanced undergraduate-graduate classes in 
educational statistics and by having student reactions to draw upon 
in the evaluation. The students, in general, found DuBois quite 
difficult. It must be admitted, however, that the mathematical (not 
to say the arithmetical) preparation of these students varied widely 
with a highly positive skew. 

This text is not only a comprehensive introduction to psycho- 
logical and educational statisties, but also a succinct introduction 
to matrix algebra, test theory, and factor analysis. As might have 
been expected from the author’s previous publications, correla- 
tional techniques are treated extensively. The primary emphasis is 
on descriptive statistics even though whole chapters are devoted 
to chi square, analysis of variance, and nonparametric hypothesis 
testing procedures. A somewhat casual survey is given the t-tests 
and analysis of variance. 

The first five chapters present statistical concepts appropriate to 
the scales of measurement, associated with S. S. Stevens. And, al- 
though one may quarrel with the inclusion of grouped data 
methods, this set of chapters is well written and was well received 
by the students, 

Three specific weaknesses in this first section deserve mention. 
DuBois, under x? techniques, discusses the adjusting of observed 
frequencies when cells have expected frequencies below five, but 
then he computes a Yates correction for continuity on an example in 
which the observed (not the expected) frequencies fall below five. 
In а computational example of Spearman’s rho, DuBois gives the 
али азр = 1 d MA A — 1), but then shows the numerical 
3 pudo X 14 x 16—which employs the equal- 
ity of N(N? — 1) = N(N — 1) (N + 1). Except for the. vec 


E 


BOOK REVIEWS 893 


cally trained student, this seemed to create a great deal of frustra- 
tion. A little more elaboration on the equivalency of the z and raw 
score distributions would have made it much easier for the students 
to accept the use of z's in the examples and formulations found in 
later chapters. 

In the next section, chapters 6-9, linear correlation and regres- 
sion are discussed—including product-moment correlation, multi- 
ple correlation, part and partial correlation, and special correlations 
(rho, tau, phi, biserial, point biserial, and tetrachoric). The exposi- 
tion of multiple correlation, using successive residualization, was 
particularly effective. The order of treatment of these topics also 
was very effective and, except for the preponderance of coding 
techniques in the correlation chapter, the examples and exercises 
are well coordinated to the subject matter. Three caveats on this 
section might be given: (1) the existence of two regression lines 
in analyzing scatter plots is not emphasized; (2) students experi- 
enced difficulty with the notation and with the (perhaps) excessive 
reliance on z-score formulations as opposed to raw scores; and 
(3) statistical concepts are employed before they have actually 
been introduced (e.g., part r in example 8.4). 

Chapters 11 through 14 lay the foundations for advanced work 
in experimental design. Probability theory is glossed over while, 
in the same chapter, skewness and kurtosis are allotted seven pages. 
In an outstanding chapter, “Families of Chance Distributions,” the 
x’, t, and F distributions are described in terms of empirical dis- 
tributions of samples of random normal deviates. Chapter 12 dis- 
cusses such topics as confidence intervals, standard errors, types of 
errors, and t-tests. No technique is given for testing for homo- 
geneity of variance nor is there a discussion of what t-test to 
apply and how in situations of heterogeneous variance. Within the 
chapter on analysis of variance nearly two pages are spent ex- 
pounding Bartlett’s Test of Homogeneity of Variance, but no 
warning is made of that test’s extreme sensitivity to non-normality. 

The last four chapters provide an excellent survey of test statis- 
tics, matrix theory, factor analysis, and non-parametric statistics. 
However, the definition of a scalar could have been greatly sim- 
Plified by calling it a real number and it would have been de- 
sirable to discuss the importance of sample size, with respect to 
the number of variables, as a criterion of adequate factorization. 
It is doubtful whether an instructor could cover these topics in a 
one semester course. 

Finally, these reviewers found that the Instructor's Manual pro- 
vides a pool of surprisingly good (in terms of item and test statis- 
ties) examination items. This same manual also contains additional 
exercises for each chapter. Unfortunately, the manual gives a num- 
ber of incorrect answers to the end of chapter exercises. 
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Although users of this text may wish to vary the sequence of 
the chapters (e.g, postpone a discussion of multiple and partial 
correlation), DuBois is clearly a superior text for use with ad- 
vaneed undergraduate or beginning graduate students. This book 
should find wide, and deserved, acceptance. 


LAWRENCE M. ALEAMONI AND У пллАм M. STALLINGS 
University of Illinois 


Elementary Statistical Theory for Behavioral Scientists by John 
W. Cotton. Reading, Mass.: Addison-Wesley, 1967. Pp. x + 86, 
paperback. 


As its title and length indicate, this book is a brief discussion of 
statistical theory, designed “as a supplement to texts in elementary 
or intermediate statistics courses." The book explores, or at least 
introduces, a wide range of topics, including distribution func- 
tions, t- and z-tests, correlation, Chebychev’s inequality, Bayes’ 
theorem, minimum variance unbiased estimators, the Central Limit 
theorem, randomization tests, robustness, power of tests, credible 
(Bayesian) intervals, and likelihood ratios. Numerous tables, 
graphs, and assigned exercises help clarify the theory and concepts, 
but the author moves rapidly, and a careless reader will become 
lost very quickly. It should also be noted that using a supplemen- 
tary text such as this one frequently poses problems of differences 
in symbolism and style from the primary text. In addition, the 
majority of topics introduced in this book are discussed more 
thoroughly in Hays’ Statistics for Psychologists and other books 
which deal with the entire elementary or intermediate psycho- 
logical statistics course. 

‚ Actually, Professor Cotton's monograph is written at a fairly 
high conceptual level for undergraduate students who have gone 
only through college algebra. True, the basic ideas underlying the 
calculus are referred to, but it has been the reviewer’s experience 
that this is not sufficient to justify the assumption that integrals 
and other calculus symbols will thenceforth be understood. There- 
fore, the instructor who adopts the book will probably discover 
that he needs to spend more time introducing his students to the 
calculus than Professor Cotton has. 

In spite of this shortcoming, the book will be useful to in- 
structors who wish to spend more time on theory in their under- 
graduate or graduate statistics courses. It will also serve as a good 


biet Nei guide for graduate students and professors to study 


Lewis В. Arken, JR. 
Guilford College 
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Elementary Statistical Procedures by Clinton I. Chase. New York: 
MeGraw-Hill, 1967. Pp. vii 4- 245. 


The academic year 1966-67 was a good one for psychology 
instructors who like to receive free elementary statisties books. 
Among the many books received by the reviewer, frequently in 
duplicate, were Runyon and Haber, Armore, Courts, Huntsberger, 
Games and Klare, and a new edition of the green Edwards. Added 
to the books by Adkins, Gourevitch, Peatman, DuBois, and several 
others received during the preceding year, the reviewer now has a 
good start on an elementary statistics library. If we add to this the 
introductory psychology textbooks, there are at least enough books 
to fill a large bookshelf. 

The matter of whether all of these books are necessary is rela- 
tive, since apparently the publishing companies and authors 
thought they were needed. But the existence of so many elemen- 
tary statistics bgoks does seem to represent a waste of psychologi- 
cal and mathematical talent, and very few instructors will have the 
time or inclination to examine all of these books in detail. 

With the publication of Elementary Statistical Procedures by 
Clinton I. Chase, McGraw-Hill now has “low level” and “high 
level” texts in introductory psychology (Sartain et al. vs. Morgan 
and King), abnormal psychology (Strange vs. Kisker), and psy- 
chological statistics (Chase vs. Guilford). In spite of the fact that 
it should probably not have been written at all, Chase's book is 
fairly well written. It is basically a low level, how-to-do-it cook- 
book in elementary psychological statistics, and as such it should 
be more popular than a theoretically-oriented book—especially 
with students having a large positive difference between their ver- 
bal and quantitative scores. 

Some noteworthy features of the book are the use of blue and 
black type (The blue type appears to be another McGraw-Hill 
“first.”), the presentation of panels of problems immediately fol- 
lowing discussion of the relevant material, and panels of computing 
Procedures, which make for quick reference and review. The 
eleven chapters, plus two appendices on algebra and statistical 
derivations, cover, very briefly, the traditional statistical topics in 
the traditional order, ranging from frequency distributions through 
elementary analysis of variance and non-parametrics. There is some 
unusual terminology, e.g. decoids and percentoids, and “smoothed 
frequency polygons” are not used enough for the attention given 
them in Chapter 2. There are also the usual spelling and typo- 
graphical errors (e.g. p. 76, footnote-“Baysian,” p. 49-last symbol 
on lines 2 and 4 should be reversed, p. 187, 1. 4-"pulses"). 
The majority of the examples given are from psychological testing, 
and this fact, together with the superiority of the chapter on corre- 
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lation and regression to the other chapters, betrays the author's 
area of special competence. 

To recapitulate, the low Flesch index and the attractive, cook- 
book format of this book will appeal to many non-mathematically 
inclined students, but there are other texts which probably do the 
job just as well without scrimping so much on theory. 


Lewis В. ArkEN, JR. 
Guilford College 


Elementary Statistical Methods for Educational Measurement (3rd 
ed.) by Albert E. Bartz. Minneapolis, Minnesota: Burgess Pub- 
lishing Company, 1966. Pp. V + 109. (paper back). 


This is not a text. While written in textual style, it is intended 
to be used as a workbook in conjunction with whatever else an 
instructor might use in teaching the subject matter of measure- 
ment. Thus, the author has escaped the almost certain wrath of a 
hardcore book review. 

Even as a workbook, the publication appears to be inadequate if 
the user is contemplating minimal coverage. The body of the 
workbook is covered in sixty-five pages under seven chapter head- 
ings titled: Frequency Distributions, Measures of Central Ten- 
dency, Percentiles and Norms, Measures of Variability, Correlation, 
Evaluation and Interpretation of Tests, and Summary and Con- 
clusions. While the chapter titles infer a respectable introduction 
to statistical methods, the content is highly superficial. For ex- 
ample, the “chapter” on percentiles and norms is covered in three 
and a half pages. No mention is made concerning other methods 
of reporting scores. Validity is covered in two pages with no 
attempt to discuss the different types, In addition, it would seem 
that something like the concept of the item difficulty index might 
fit into the domain of elementary statistical methods for educational 
measurement. However, nothing is introduced in the area of item 
ШУ throughout the entire workbook. It is no wonder that 
j аи Summary and Conclusions, is less than a page and 

As a means of evading the lack of coverage and depth, the 
author repeatedly indicates the introductory eres of the book 
and refers the student to 4, . . any current text on tests and 
measurements.” It is the opinion of this reviewer that any current 


text on tests and measurement j i 
och aa NEED nt would $ се better than this one, 


James C. Moors 
The University of New Mexico 
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Basic Mathematical and Statistical Tables for Psychology and Edu- 
cation by William М. Meredith. New York: McGraw-Hill 
Book Co., 1967. Pp. ix + 333. 


Readers of books in the statistical series, psychology and educa- 
tion series, and in the mathematics series of the McGraw-Hill 
Book Company will recognize many of the tables that appear in 
this volume. Although there will be those research workers in 
psychology and education who cannot help making a remark to 
the effect that the compilation of tables with a few introductory 
remarks here and there is an easy way to write a book and to gain 
a quick credit for promotion up the academic ladder, both students 
and professors in the behavioral sciences can be grateful to Profes- 
sor Meredith for having brought together in one place nearly 70 
tables that are immensely useful in the quantiative treatment of 
experimental and test data..One will readily recognize tables bor- 
rowed from suchefamiliar sources as Dixon and Massey's Introduc- 
tion to Statistical Analysis, Guilford’s Fundamental Statistics in 
Psychology and Education, Siegel’s Nonparametric Statistics in the 
Behavioral Sciences, and Winer’s Statistical Principles in Experi- 
mental Design. The selection is complete and attractive for the 
research worker in both psychology and education. Thus this 
volume may be expected to find its way very soon to the shelves 
of most graduate students, professors, and research personnel in 
both civilian and military agencies. 

There are nine principal sections of which the first three will be 
familiar to students with at least one or two years of college 
mathematics. Subsequent to the presentation of miscellaneous 
numerical tables involving trigonometric functions, assorted log- 
arithmic functions, gamma and factorial functions, and binomial 
coefficients and factorials in Section 4, nine different tables on 
common probability distributions are included in Section 5. 
Twelve familiar tables corresponding to classical statistical tests 
are included in Section 6, and nine well-known tables for common 
Nonparametric and so-called distribution-free tests are incorpo- 
tated within Section 7. For the area of correlation and association 
in Section 8, ten tables have been presented. The closing section 
Consists of miscellaneous statistical tables that vary from simplified 
statistics and statistics based on sample ranges to tables of random 
numbers, random normal numbers, coefficients of orthogonal poly- 
Nomials, arcsin transformations, entropy of a discrete binary source 
pe list of references from which several of the tables have been 

aken. J 

Tn addition to tables with numerical entries, several nomographs 
and abaes are incorporated at strategic points. The readability and 
format of the tabular and graphic material are highly pleasing, 
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legible, and easy to follow. The reviewer is happy to recommend 
this volume as an important addition to the library of the student 
and scientist alike. It should enjoy both a wide distribution and 
longevity by any reasonable norm in the publishing business. 


Упллдм B. MICHAEL 
University of Southern California 


Review of Educational Research: "Methodology of Educational 
Research" by Leslie D. MeLean (Editor). Washington, D. C.: 
American Educational Research Association. Volume XXXVI, 
No. 5, 1966. Pp. 487-623. $3.00. 


The editors of Review of Educational Research have again made 
a significant contribution to the dissemination and the evaluation 
of information relative to educational research methodology. The 
planning of the concerned review is particularly tq be commended. 
Educational research has been approached with a very broad view, 
and the issue is not confined in coverage to merely statistics, 
mathematics, and design, although these and related subjects have 
received ample attention, Coverage of other important educational 
research areas, such as methods of observing and recording be- 
havior, has contributed to the review's value to researchers, grad- 
uate students, school administrators, and university teachers. 

The clarity of the writing and the quality of the editing are 
exemplary. Partieularly noteworthy is the fact that the authors of 
the more technical chapters have managed to achieve a high level 
of readability where space limitations have precluded lengthy ex- 
planations. That there is little resort to complex symbolic mathe- 
matical formulations will be gratifying to those readers who have 
limited mathematical training and the more quantitatively sophis- 
ticated readers who wish to peruse the journal rapidly. 

The soft bound book includes reviews of the statistical literature 
for the three-year period beginning in December, 1963. Literature 
in the other areas is essentially reviewed for the six-year period 
beginning in December, 1960. Where appropriate, however, au- 
thors have referred to significant articles published earlier. 

The titles of the eight chapters and their authors are as follows: 
. “Design and Analysis Methodology—an Overview" by Leslie 
D. McLean, П. "Bayesian Statistics" by Donald L. Meyer, Ill. 
Nonparametric Statistics" by Ellis B. Page and Donald R. Mar- 
cotte, IV. The "Observation and Recording of Behavior" by 
Robert D. Boyd and M. Vere Devault, V. “Sample and Survey 
Designs in Education—Focus on Administrative Utilization" by 
Orlando F. Furno, VI. “Factor Analytic Methodology” by Gene 
V. Glass and Peter A. Taylor, VII. “Computer Assistance with 
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the Educational Process" by Duncan N. Hansen, VIII. “Multi- 
variate Analysis" by Elliot M. Cramer and R. Darrell Bock. 

The quantitatively oriented reader will find much to offer in the 
chapters on design methodology, Bayesian statistics, nonparamet- 
rie statistics, factor analytic methodology, and multivariate anal- 
ysis. The evaluative nature of the factor analytic chapter makes it 
particularly valuable for a student in the concerned field, The 
dedication of a chapter to the burgeoning field of multivariate 
analysis represents a long-needed recognition of the importance 
of this area, 

The chapter on computer-assisted instruction is highly informa- 
tive and provides reference to many important documents of which 
the typical reader may not be aware. The authors of the chapter 
on the observation and recording of behavior are to commended, 
not only for excellent description of the pertinent literature but 
also for developing a classificatory system to describe the category 
systems which have been developed to record and describe be- 
havior. 

The chapter on sample survey designs is essentially a “how to do 
it” guide for school administrators. Although the chapter achieves 
its objectives well, and although there was probably need for such 
a guide, there is a question of the appropriateness of including a 
chapter such as this in a review of educational research 
methodology. 

In view of the large number of published and unpublished books 
and articles cited, the review can be valuable as a source of refer- 
ence material alone. 

Despite minor shortcomings, Methodology of Educational Re- 
search, should be a welcome addition to the library of anyone 
seriously interested in educational research and related topics. 


Mary І. TENoPYR 
North American Aviation 
El Segundo, California 


Assessing Behavior: Readings in Educational and Psychological 
Measurement, by John T. Flynn and Herbert Garber. Reading, 
Mass.: Addison-Wesley Publishing Co., 1967. Pp. vi + 377 
(paper), $3.95. 


One of the seeming sine qua non’s of pedagogy at the university 
or college level is to coerce students into reading primary-source 
material—collateral, outside reading—of some kind. With the 
rapid expansion of "university and college building, the inevitable 
and concomitant difficulty for libraries to have all copies of all 
journals or key books has given some impetus to the concept of 
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academic anthologies. Availability of literary material is an es- 
sential to efficient, scholarship, and even if books of readings are 
never all things to all men, they certainly make accessibility of 
important writings less of a problem. 

That access has of late become a problem in the measurement 
area is attested to by the number of anthologies that have recently 
appeared. The present volume, however, is different in that it has 
not called upon the usual and obvious kinds of writing that have 
been condensed and reinterpreted many times. Here is an anthol- 
ogy that may be genuinely considered additional, or outside read- 
ing, and that could add both depth and scope (though primarily 
the latter) to a measurement course. 

Naturally, one does not react with equal enthusiasm to all the 
articles that have been used in the book, but one could, with fair 
honesty, say that there was nothing included that one would ob- 
ject to, as being wholly irrelevant. Its orientation is non-quantita- 
tive; it has striven to appease both the psycehemetrie and the 
clinical camps; and it has obviously tried to impart knowledge 
that will enhance the interpretation of test data for what are 
perhaps unkindly called the “low-level” or irregular, test users. 

The reviewer has, in fact, only two major concerns. One con- 
cern is with the claim that the book is designed for senior, gradu- 
ate, and advanced graduate courses. Many potential users may be 
threatened by this avowed purpose, when in reality the material 
could probably be used at a much earlier level, and anyway should 
be fairly common knowledge for advanced graduate students in 
measurement. The second concern is the avoidance of quantitative 
papers. One is left with a distinct impression that the articles and 
quotes that have been selected for the printed volume have been 
done so with an eye to saleability rather than with an eye to reflect- 
ing the current status of testing. Just the look of the book makes 
it appear like a text of ten to fifteen years ago, and the sheer fact 
that it has been correlated for appropriateness to all the major 
testing texts (Anastasi, Cronbach, Ebel, Gronlund, Stanley, Thorn- 
dike and Hagen) at once belies its claim to sophistication and 
establishes a certain conservatism. Insofar as psychomoetries is а 
quantitative “science,” then should it be so represented. If in- 
structors feel they cannot make use of quantitative material, there 
18 no compunction to assign every article. But sooner or later, 
publishers particularly, and authors second, are going to have to 
acknowledge the existence of quantification in measurement and 
the need for numerical skills to keep abreast of developments in 
the field. 

Very broadly, the book clearly has been conceived as contrib- 
uting to a knowledge of the history, problems, and strategies of 
measurement. The material is composed of a collection, in ten 


p 
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chapters, of articles on the following topies: The history of meas- 
urement; essentials of psychological tests (reliability, validity, 
norms); mental measurement; personality measurement; achieve- 
ment tests; classroom evaluation; and measurement in research. It 
would be tedious, and profitless, to list the titles of all 38 articles 
in the text, but a few representative selections should illustrate the 
direction of the book. 

First, in regard to the purposes of testing, there are four articles, 
three from 1960, and one from 1958. An article from Scientific 
American by Herbert Dingle, establishes the basic problems of 
Measurement, and the reader is regaled by the ETS viewpoint on 
testing with articles by Dyer and Chauncey. A lay description of 
the relationship of measurement to the educational process, by 
Dorothy Adkins, rounds out the section. 

The next three chapters are far more invigorating, Reliability 
has been over-viewed through the eyes of less well-known, but 
nonetheless peraeptive observers, in addition to classical contribu- 
tors. With the exception of the quote from the Standards, these 
articles are fresh, pithy, and controversial. Wesman’s argument that 
test reliability is more crucial than scorer reliability is an interest- 
ing revival of his 1952 Test Service Bulletin. And there are other 
equally unconvincing displays of logic. There is an article on the 
culture-fair testing movement; one on the correction for guessing; 
опе on norms. The use by the authors of Test Service Bulletin 
material might make any self-respecting Scotsman shudder (after 
all, the Psychological Corporation does graciously distribute Bul- 
letins free), but the general effect and happy juxtaposition of 
articles has been remarkably well achieved. 

Chapters 5 to 7 are devoted to articles about specific kinds 
Of tests (cognitive, non-cognitive) and to decisions based on test 
data of these kinds. Brim’s article from the 1965 American Psy- 
Chologist on attitudes toward intelligence testing and Loretan’s 
Suggested alternatives set a speculative stage for the more sombre 
Performances of such persons as Ruff and Levy, Thorndike, and 
Cronbach and Gleser. The choices of this section are again a 
Telief insofar as they do not follow oft-trodden paths. They do 
tend to supplement rather than repeat. The key word is "tend": 
Why reprint a slice of Cronbach’s Essentials which surely is freely 
available? Or even Cronbach and Gleser’s Psychological Tests and 
Personnel Decisions, especially when this latter excerpt does little 
but advertise the book? But Thorndike’s two articles, Ruff's, and 

arling's have been well chosen. 

Chapters 8 and 9 are the stuff of which low level texts are 
Made. The choices are too vague, too all-embracing to be of much 
Use to the psychometrician (or even clinician). There is no deny- 
mg that the articles are fun to read and to debate, but who, at 
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advanced graduate level, needs to know such gems as: “larger and 
better (sic) school systems have made effective use" of test spe- 
cialists "for years”; or that “there are a number of ways that test 
scores are misused or overused”? These kinds of generalizations 
again confirm the suspicion that this is not such an advanced set 
of readings, after all. The lapse into a concern for teachers and 
classroom testing may be excusable, but surely some more pro- 
found thoughts have been formulated. Among authors whose se- 
lections appear in this section are Ebel, Diederich, Stanley, Womer 
and Lennon, each and every one of whom has had better things to 
say. 

The final section on measurement in research, recovers the im- 
petus of the earlier sections, somewhat. There is a very useful 
article by Brownell on evaluating learning under different systems 
of instruction, an article by Allport on traits, by Comrey on the 
logic of measurement, and so on. All very useful and mature con- 
tributions. * 

Apart from references at the end of some articles, there are no 
guidelines for further (yet further) reading. The authors have 
written perhaps 50 to 100 words to introduce each article: the 
introductions are often more irritating than illuminating, and 
rarely worth bothering with. 

How to sum up? One can really do little other than reiterate 
that somewhere along the line of educational progress, measure- 
ment students will undoubtedly profit from an acquaintance with 
many—if not most—of the articles adduced. When, and to what 
extent, will depend on the individual instructor. But a prospective 
user could probably make good use of much of this material at а 
relatively early stage in the educative process, and return to it at 
more advanced stages as fuel for debate. Perhaps Johnson fore- 
stalled the reader when he said, “read over your compositions, and 


wherever you meet with a pass i ative EO 
Hao eU Gub passage which you think is particularly 


Ретев A. TAYLOR 
Rutgers University 


Measuring Pupil Achievement and Aptitude by C. M. Lindvall. 


New York: Harcourt, Brace and W i + 
188. $2.25 (paperback), nd World, Inc., 1967. Pp. xi 


In nine succinct chapters the author covers a wide range of 


topics in educational measureme i i i 
r 1 nt and evaluation. After discussing 
the role of tests in education in the first chapter, he proceeds to 
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discuss in detail the planning of the evaluation process in the sec- 
ond chapter by putting considerable emphasis upon the need for, 
derivation of, and procedures for preparing specific behavioral ob- 
jectives within the framework of a taxonomy and matrix. This 
chapter, along with the third one on principles of achievement 
test construetion, constitute two of the strongest sections of the 
volume. In chapter four, consideration is given to the use of tests 
to measure various levels of ability such as knowledge, compre- 
hension, application, analysis, synthesis, and evaluation. The illus- 
trative test items are particularly helpful to the teacher and be- 
ginner in the construction of classroom tests. 

The next two chapters—five and six—are devoted respectively 
to using statistics to derive test scores and to employing statistics 
to appraise tests. Although most of the familiar topics are covered 
in relation to norms, reliability, and validity there is serious con- 
cern on the part of the reviewers regarding the extent to which 
teachers may be able to profit from such a brief presentation, as 
illustrative material and accompanying consideration of the funda- 
mentals of measurement are quite limited. Similarly, the considera- 
tion given to standardized Achievement tests in chapter seven and 
to tests of scholastic aptitude in chapter eight is quite abbreviated. 
It might have been helpful had the author presented a listing of 
the major standardized tests in tabular form with some brief de- 
scription of the intended group of examinees, the basic purposes 
of the instruments, and source material concerning school situa- 
tions in which the measures might be advantageously employed. 
Although the last chapter sets forth a number of mechanical pro- 
cedures that may be employed in evaluation of school programs, 
little attention is given to relating these mechanical modes to basic 
problems in the curriculum or to the learning and development 
process. In other words, there appears to be a marked lack of 
relating what are essentially measurement procedures to the rela- 
tively more subjective goals and philosophy underlying the edu- 
cational process. The reviewers thought that a golden opportunity 
existed in the concluding chapter to show relationships to the 
Message so carefully set forth in the second chapter which was 
concerned with the planning of the evaluation process. Their ex- 
Pectations, however, were not fulfilled. 

Despite certain limitations in this volume, the overall presenta- 
tion compares favorably with that found in a large number of 
recently published books in the area of educational measurement 
and evaluation. One must realize that the author has had to cover 
а great deal of material within a space limitation considerably less 
than that imposed upon most authors. In terms of a cost effective- 
ness model, this volume is a bargain for both the student in the 
beginning course in educational measurement and for the school 
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system which may wish to inaugurate an in-service training pro- 
gram for its teachers and administrators. 


Joan J. MICHAEL 

San Fernando Valley State College 
Wium B. MICHAEL 

University of Southern California 


Intelligence: Perspectives 1965 by Orville G. Brim, Jr., Richard 8. 
Crutchfield, and Wayne H. Holtzman. New York: Harcourt, 
Brace & World, 1966. Pp. x + 101. $3.75. 


Intelligence is a highly valued commodity in the labor markets 
of the world. For that reason, among others, it is important that 
intelligence should be validly defined and accurately measured. It 
is valued both as a raw material and in a more refined state as a 
product of formal education. The accurate measurement and valid 
definition of intelligence are important for quality control in the 
educational process, to insure that the refining process does not 
result in waste, dilution, or contamination. Better measurement of 
intelligence might well result in a more reasonable scale of prices 
for the refined product and more efficiency in the production 
process, 

This book presents three of the views of intelligence that have 
been receiving increased popular attention in recent years. These 
views are presented by leaders in their respective areas of interest, 
persons whose views are likely to be of continuing significance. 
However, there is not much in common between the three chap- 
ters they have presented, and perhaps the best service a reviewer 
may provide is to suggest a more comprehensive description of 
what intelligence may be, so that the reader of the book may 
evaluate each of the three facets presented in the book as they 
contribute to one’s total understanding of the subject. 

In one view, intelligence is the ability to solve problems and the 
ability to learn solutions to new problems. Wayne Holtzman 
quotes Boring's operational definition of intelligence, “the capacity 
to do well in an intelligence test.” There is surely some truth in 
this, although the reviewer has taken only a phrase out of context. 
However, it is very important to note that the tasks chosen for 
intelligence tests have very frequently been validated against 
school achievement as well as age norms. Therefore, in a very real 
sense, intelligence is the capacity to do well in school, to solve the 
kinds of problems that teachers have chosen to set for their pupils. 

Í course there are many different elements that bear on the 
acceptability of an item as part of an intelligence test. These ele- 
ments include: convenience of administration, low cost, effects of 
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time limits, reliability particularly with respect to day to day varia- 
tions, and validity as indicated by developmental trends and cor- 
relation with school success. It should be noted that schools must 
have some effect on the developmental trends that can be seen in 
Western society. Such criteria as these all strongly favor the use of 
verbal material in intelligence testing. Edward L. Thorndike’s sug- 
gestion that there may also be social and mechanical elements of 
intelligence must, in Western society, overcome some tremendous 
handicaps if it is to be persuasively shown to be valid. There may 
well be some circularity in the academic situation that tends to 
attract persons whose gifts are predominantly verbal and who then 
control the curriculum that tends to be based primarily on verbal 
activities. 

The first two chapters of this book present the views of men 
who have the vision to go beyond the limits of ordinary verbal 
intelligence. In the first chapter, Wayne Holtzman presents an 
interesting discyssion of some of the history of intelligence testing 
and then goes on to describe a developmental study he has em- 
barked upon with a special emphasis on the relationship of intelli- 
gence to cognitive style. Mhis study may be of special interest 
because it will trace development over the full range of school 
from first through twelfth grade in two cultures, the United States 
and Mexico. This is followed by Richard Crutchfield’s chapter 
describing his work on teaching creative problem solving to fifth 
and sixth grade children. A feature of particular interest in this 
work is Crutchfield’s use of programmed materials for this pur- 
pose. Both of these chapters are written in clear and interesting 
style and both of these studies should serve to improve the defini- 
tion of intelligence and to show how it may increase the effective- 
ness of education. 

The third chapter of the book is not concerned with the nature 
of intelligence, but with the self-estimates of high school students 
concerning their intelligence and with the implications that these 
self-estimates may have. It includes a number of interesting and 
plausible conjectures about the concomitants of high and low self- 
estimates and a number of highly detailed tables of numerical data 
based on a questionnaire that had been administered to a represent- 
ative sample of public school students and to purposive samples of 
students from non-public schools. The author is careful to refrain 
from making statistical significance tests for almost all of his con- 
Jectures, since the data were in fact not amenable to such treatment. 
The reader should be warned to be equally careful in not attribut- 
ing too much significance to this study. It is unfortunate that the 
data are presented in such a way that it would be easy to make 
such a mistake. 

The author of this chapter does not appear to be an uncritical 
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that he places so much confidence in the responses he obtained to 
2 questionnaire, particularly to responses to a single question out 
supporter of standardized testing. It seems a little strange, then, 
of the 331. Twelve of the eighteen tables are devoted to the anal- 
ysis of the responses to item 126. It seems possible that a large 
number of the 4,000 students in the lower half of this group may 
have been reluctant to give a completely candid answer to the 
question, “How would you say you compare in intelligence with 
other high school students in the United States?” The reader of 
this chapter is given no information about what steps, if any, were 
taken to check on the possible effects of such reluctance. It must 
be admitted that this sort of information is frequently omitted in 
psychological literature, but this omission nevertheless makes this 
chapter somewhat difficult to use as the basis for informed decision 
making. 

There may perhaps have been two reasons for this chapter hav- 
ing been presented as it was. This work is a pioneering effort in its 
field, and pioneers are not generally noted for their convention- 
ality. Second, the author appears to be convinced of the importance 
of the problem he is dealing with and he has done all he could to 
bring it forcibly to our attention. However this may be, it is this 
reviewer's opinion that the cause of testing and the protection of 
those tested will be best served if the reader of this chapter reads 
with a careful skepticism. 

In summary, this book presents three important points of view 
about intelligence testing. The first two chapters present some of 
the interesting new developments in cognitive testing. The third 
chapter expresses some reservations about the effects of testing on 
the self-concepts of high school students, Whether one accepts or 
rejects these three points of view, it appears that the opinions of 
all three authors are likely to be influential and that they deserve 
careful critical thought. 

CHARLES T. Myers 
Educational Testing Service 
Princeton, New Jersey 


Unobtrusive Measures: Nonreactive Research in the Social Sci- 

о ou eb Donald T. Campbell, Richard D. 
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Chapter 1, “Approximations to Knowledge” criticizes the fact 
that most social science research is based upon interviews and 
questionnaires at the expense of what the authors call multiple 
operationism, a collection of methods. Threats to the valid inter- 
pretation of a difference are divided into three groups: error that 
may be traced to those being studied; error that comes from the 
investigator; and error associated with sampling imperfections. 

Chapter 2, “Physical Traces: Erosion and Accretion” examines 
research methods involving physical traces from past behavior. 
Physical evidence is divided into two broad classes: erosion meas- 
ures, where the degree of selective wear yields the measure; and 
accretion measures, where the research evidence is some deposit 
of materials. 

Chapters 3 and 4 deal with the use of archives by social sci- 
entists. It is suggested that such a source of data may serve to 
avoid the problems of invasion of privacy by allowing the re- 
searcher to abtain valuable information without identifying or 
manipulating the individuals involved. 

Chapter 5, “Simple Observation" suggests that the unobserved 
Observer may appropriately function in certain situations but not 
in others. The main advantage of the use of this technique is that 
data are obtained first-hand. 

Chapter 6, *Contrived Observation: Hidden Hardware and Con- 
trol” discusses the use of electronic devices to acquire data. 

The book admittedly deals with elementary social science re- 
search measures. The sophisticated researcher may profit from a 
reminder as to techniques he has previously studied; the beginning 
researcher may be introduced to research techniques for the first 
time. 

The real value of this book may well be a complementary vol- 
ume to a text in introductory research classes in the social sciences. 
In this way students who plan to adopt a behavioral science orien- 
tation and students who plan to adopt a humanities orientation may 
see how the authors’ ideas may be employed in secondary school 
teaching. There is a great need for a book which considers empiri- 
cal approaches adaptable to both the humanities and the behavioral 
Sciences. This book may be most useful in this respect. 


Darm L. BRUBAKER 
University of California, Santa Barbara 


Essentials of Educational Research: Methodology and Design by 
Carter V. Good. New York: Appleton-Century-Crofts, 1966. 
Pp. x + 429. 


The new adaptation and updated version of Good's book has 
evolved from his earlier text Introduction of Educational Research. 
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From all apparent evidence these present volumes were offshoots 
from Methods of Research: Educational, Psychological, and So- 
ciological, co-authored by Good and Seates, which originally was 
derived from the text The Methodology of Educational Research 
by Good, Barr, and Scates. 

In reading the present volume the reviewer found that the gen- 
eral introduction and philosophical material is verbose, fuzzy and 
uninspiring. Good has attempted in the first hundred pages to 
undertake the work which Boring completed in his book A His- 
tory of Experimental Psychology. Another review of the great 
psychological scientists is superficial and evades the crucial issues 
with which research workers should deal. Those who read this 
book in hope of having informative, critical, and insightful material 
with which to improve their quality of research will be disap- 
Pointed. Unfortunately, this book lacks even the evidence of dis- 
cussing sufficiently theory or its role in research. In fact, instead 
of presenting the essentials of educational research with rigor, 
precision, and practical application for the amorphous and highly 
speculative trends of present and future research, the entire book 
18 instead a scholarly approach to the methods and designs of 
research. 

Only three of the nine chapters deal primarily with the es- 
sentials of research. These three discuss the aspects of surveying, 
experimental designs, and reporting, In the fifth chapter, an at- 
tempt is made to present a Variety of descriptive survey studies 
and techniques which range from the general descriptions and 
questionnaire inquiries to action research. This chapter provides 
т sampli i Il 
as treatment of interviewing, И оем 

The seventh chapter on experimental design is probably the 
Most beneficial part of the entire book. The development and 
validation of ten experimental designs are discussed with both their 


writing their dissertations, 

The author has been too lustrou 
mentations, and quotations and h 
Sence of research. Thus, as an 


research, the utilization of this 
the reviewer. 


8 in providing references, docu- 
as shirked the fundamental es- 
introductory text in educational 
book cannot be recommended by 


G. Rosmen Warp 
University of California, Santa Barbara 
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A Cross-Section of Educational Research by Edwin Wandt (Edi- 
tor). New York: David McKay Company, Inc., 1965. Pp. x + 
301. 


In this book Wandt has reprinted forty educational research 
articles that were published in various journals indexed in the 
Educational Index during the years 1960-1964. The collection of 
articles that Wandt has presented are research oriented and con- 
tain a statement of the problem, collection of data, analysis of 
data, and conclusions. They represent a wide variety of designs, 
subjects, methods of gathering data, and methods of analyzing 
data. 

This book is designed as a supplement for courses in methods of 
educational research. Instead of the graduate students having to go 
to the library to select the articles in the various journals, this 
book provides the students with a variety of articles that can be 
helpful to him in differentiating between good research, which he 
may wish to émulate, and poor research, which should be avoided. 
The student should be provoked into generating new ideas and 
stimulated into exploring and researching. 

The reviewer had looked forward to finding some incisive work 
that would help the graduate student to find a deeper understand- 
ing and meaning to the science of research and to develop the art 
of presenting the results in readable form. His expectations were 
fulfilled in the introductory section of this book of readings. How- 
ever, he was disappointed to find that the major portions of the 
book contained nothing but collections of various research articles 
and no discussions or critical evaluations of the articles or sug- 
gestions for conducting good research and for avoiding common 
errors. 

The major limitations of this book are that there are no editorial 
comments, summary reviews, or guided explanations, by which 
the student could better grasp the strengths or weaknesses of the 
articles according to the major criticism that “only 10 per cent of 
published papers in educational journals are worthy of being re- 
ported in the Review,” and, further, that “by minimum acceptable 
research standards, 95 per cent of the work in the field . . . that is 
concerned with casual analysis is, by either theoretical or practical 
Standards, invalid or trivial." 

In the reviewer's estimation this book of readings is restricted 
because of the above limitations. Anyone can put together a col- 
lection of articles. However, it takes insight and skill to direct a 
Student through the crucial issues. The reviewer believes that 
Wandt started to point out some insights and skills with his in- 
troductory section, but failed to carry through with his ideas. 

This collection of readings provides a fair and reasonable presen- 
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tation of the contributions that educators are currently making to 
education, but it gives the reader only a faint impression of how to 
improve his own research. Hence the utilization of this book will 
be restricted because of its inherent weaknesses, although admit- 
tedly the editor may not have intended to give specific critiques 
on each of the articles chosen. 


С. Ковевт Warp 
University of California, Santa Barbara 


Methods for Experimental Social Innovation by George W. Fair- 
weather. New York: John Wiley and Sons, Inc., 1967. Pp. 
x + 250. $7.95. 


How value-free can and should the social scientist be? What is 
the social scientist’s responsibility, if any, to his society? Positivists 
advocate a “hard-headed” value-free role for the social scientist, 
often feeling that the social sciences are methodologically equiv- 
alent to the natural sciences; a second view is the humanitarian 
value-oriented position of those who see social science as primarily 
service oriented. George W. Fairweather advocates a third view: 
the social scientist has the responsibility to aid his society through 
the application of scientific methods to social problems so that 
change can occur in a systematic, planned, and orderly manner 
compatible with and essential to humanitarian values. (The author 
admits that his view is closer to the second than the first view 
cited above.) How can this be done? In the words of the author, 
“І propose that the answer to this question is to create a new social 
subsystem whose methods include innovating models as alterna- 
tive solutions to social problems, experimentally evaluating them, 
and disseminating the information to those who can make the 
Appropriate changes.” (p. vi) This is what the author means by 
experimental innovation. It should be added that the creation of 
new subsystems, eg. the Job Corps, would afford direct aid to 
marginal members of our society so that they might become in- 
tegral members of our society. 

Implicit in Fairweather’s thesis is the assumption that a rational, 
well-planned approach to social problems is both desirable and 
possible. For example, he criticizes the antipoverty programs, 
which have been relatively recently introduced, for a lack of prior 
experimental validation. He also feels that the results of possible 
alternative solutions to any given social problem should be dis- 
Mere to legislators and other decision-makers before a par- 
ticular solution is adopted. One finds an almost naive optimism 
in many of the author's prescriptions; at any rate it would be 


interesting to have deeision-mak: i 
of Fairweather's ideas, ers, e.g. legislators, react to some 
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The author's forte is social psychology. He has been Chief of 
Social-Clinical Psychology Research Unit at the Veterans Hos- 
ital in Palo Alto since 1957. Examples from his own first hand 
erience are clear and helpful to the reader; when the author 
oves into other areas such as publie education he is on less firm. 
ground. The reader must decide for himself whether or not the 
"author's thesis is applicable to all social systems. 
^ In Chapter 14, the author proposes that training and experi- 
mental centers be established across our nation. Specific institu- 
‘tions to be involved are universities, industrial institutions, govern- 
mental agencies, and private agencies. According to Fairweather, 
these agencies would themselves be a new subsystem which would 
act as a mechanism for social change. Such agencies would make 
) ssible the solution of social problems by reforms before crises 
ы emerged. Social scientists would be more concerned with the solv- 
ing of social problems than their particular discipline’s biases, 
according to the author. 
In short, the controversial thesis of the author demands our 
tention—especially in the light of present day attacks by large 
social organizations on social problems such as poverty and the 
culturally disadvantaged. Although the book deals with the field 
of education mainly by inference rather than direct attention, the 
T implications for education are important. 
Darp L. BRUBAKER 
University of California, Santa Barbara 


uman Information Processing by Harold M. Schroder, Michael 
— J. Driver, and Siegfried Streufert. New York: Holt, Rinehart 
E | and Winston, Inc., 1967. Pp. viii + 224. 


_ The value of this small volume lies not so much in the attempts 
of the authors to fit a model to group process but in their sugges- 
ions of the application of multidimensional scaling techniques to 
ese processes, However, one wishes that the situations surround- 
_ ing the application of the proposed models would not have been so 
‘structured as they were. Only in two of the ten chapters are the 
P effects of internal structure (individual differences) taken into 
account, 
_ Beginning with a discussion of hierarchically structured models 
E for human information processing in terms of indices of integra- 
tion ranging from concrete to abstract, the authors come to their 
basic model which they term the U curve hypothesis. This basic 
_ two dimensional model has as its parameters environmental com- 
plexity and level of information processing. Up to a point, in- 
ases in environmental complexity allow for a higher level of 
formation processing but as the level of complexity continues 
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to increase, the level of information processing asymptotes and 
then declines until the complexity overwhelms the ability of the 
organism to deal with relevant information. 

The authors do well to acknowledge the work of Hebb and 
Berlyne as precursors of their theorizing, but they completely 
ignore Denenberg’s (1964) important paper which must be viewed 
as the general model from which the one presented in this volume 
stems. Denenberg, summing up a research program lasting for 
better than a decade, wrote “. . . It is reasonable to expect that 
there will be an optimal level of emotionality for efficient perform- 
ance. As one moves away from this optimal level, performance 
should drop off, thus resulting in an inverted U function” (p. 341). 
Denenberg continues by expanding this model to include the ef- 
fects of motivation upon the individual which is hinted at, but not 
included, in the theorizing presented here. This omission is the 
more surprising because the authors do discuss these effects but 
leave the reader with the impression that they are at a loss for an 
explanation. 

What is especially heartening to this reviewer is the use of 
nonparametric statistics in the analysis of the form of sociometric 
data employed here. Many of the generalized experimental de- 
Signs used here have been previously employed under well con- 
trolled conditions only to have the implications of the results lost 
under а mass of parametric statistics with no relevance to the 
problem at hand. This is not the case here, and the relative sim- 
plicity of the experiments in no way detracts from their elegance. 

The chapters dealing with the idiographic variables are a re- 


anism and dogmatism. While these attempts seem, at times, almost 
an afterthought, their inclusion in a model such as this seems to 
extend its utility. 


The book has a number of implications for the realm of clinical 


function of a position on 


the concrete-abstract continuum would appear to have greater 


functional utility. 
If the theoretical approach is not completely new, the redeeming 
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value of this volume is its appendices. Contained in these last 40 
pages are the methodological approaches that brought forth the 
effort. The attempts in these pages to quantify information inte- 
gration and the examples of the derived scoring system are excel- 
lent and well worth the time of anyone who is attempting to deal 
meaningfully with quantified verbal material. 

In summary, this reviewer has divided feelings about the book. 
As a theoretical approach to the basie problem in the behavioral 
sciences of optimal levels of stimulus input and associated per- 
formance, it adds nothing new to an already burgeoning field. 
However, as an extension of the general hypothesis and a heuristie 
contribution there is much contained which could add to the 
utility of future research in this area. 


REFERENCE 
Denenberg, V. H. Critical Periods, Stimulus Input, and Emotional 
Reactivity: A Theory of Infantile Stimulation. Psychological 
Review, 1964, 71, 335-351. 
Steven С. GOLDSTEIN 
University of Oregon Medical School 


The General Inquirer by Philip J. Stone, Dexter C. Dunphy, 
Marshall S. Smith and Daniel M. Ogilvie. Cambridge, Massa- 
chusetts: The M. I. T. Press, 1966. Pp. xx + 651. 


This volume is a description of the mechanics and reasoning 
underlying the application of computers to content analysis, one 
of the most venerable psychological and educational research tools. 
There is no question but that the book is a very good one from 
many aspects: it reads easily and, for a four-author effort, this is 
10 easy accomplishment; it covers the material fully leaving the 
reader without any basic questions; and its approach incorporates 
pain theory and methodology without one detracting from the 
other. 

. Content analysis, in its general form, is a procedure designed to 
give some degree of nominal, ordinal or interval scale properties 
to essentially qualitative data. The more extreme in tone and feel- 
ing is the basic material, the greater the value assigned to this 
material Content analysis can range from counting, e.g., the fre- 
quency with which one word preceeds another in a paragraph, to 
Some of the more sophisticated scaling techniques which have as 
their final steps the assignment of values to words positioning 
them on a known (or equal) interval scale. The attempt to com- 
Puterize this process is a laudable one especially when it is realized 
that the input material is a paragraph or section keypunched di- 
Tectly on an 80 column machine card with a minimum of editing. 

The first section of the book is devoted to a history of the 
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problem and a long and clear exposition of the solution proposed 
by the authors. At times however, there is a feeling of being toyed 
with. For example, in discussing the uses of content analysis in 
different aspects of the behavioral sciences, the authors use as an 
example a content analysis done on the Schreber autobiography 
which was first made famous by Freud. Schreber was diagnosed a 
paranoid schizophrenic with homosexual preoccupations. His auto- 
biography makes much of the sun and its position in regard to him. 
For years there has been a psychoanalytic dispute as to the sex of 
the symbol that the sun represented. The authors cite a study 
where a content analysis of the interaction of the sun with words 
that have gender was considered in an effort to attach gender to 
Schreber’s sun. The reader is now thoroughly involved in the 
Inystery, waiting for the synthesis. He is to be disappointed 
though, for no answer is ever provided, and the book continues. 

The discussion of the actual programming solutions to be used 
on computers with different size memory banks is especially valu- 
able, The basic flow problem is the same for all systems, but the 
peripheral problems which are generally ignored in discussions 
such as this are also taken into account. The basic procedures of 
tagging (adding a syntax code indicating subject, verb, or object) 
and compiling a content analysis dictionary are gone into thor- 
oughly, and concise examples are provided frequently. For those 
who desire to go into the programming in depth, a user’s manual 
for The General Inquirer can be purchased as a separate volume. 

The second portion of the book is by far the largest and con- 
tains а series of 16 research reports which utilized content analysis 
in all areas of the behavioral sciences, They all follow the pro- 
cedures and statistical methodology generated in the first portion 
of the book ‘and serve to underscore the utility of the procedure. 
For this reviewer, the three studies in the area of political science 
Were especially interesting, but the reason for this is unclear. It 
may have been that the procedure was intuitively appealing. More 
likely, however, is the fact that the results support some of the 
reviewer's long standing political biases. 
ub à summary of methodological thinking from Lasswell 

ough Bales (to whom the book is dedicated) to the present 
authors, the work contains a great deal to recommend it to readers 
interested in this form of analysis, Certainly one of the more 
Pleasing features is the facility to go to most forms of statistical 
hee once the content analysis has been achieved. In short, 
within ite ages authors appears worthwhile and there is much 

\ overs to recommend it to anyone interested in the 
analysis of verbal or written material, 


Steven G. Gonpsteny 
University of Oregon Medical School 
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Information by the Editors of Scientific American. San Francisco: 
W. H. Freeman & Co., 1966. Pp. xii + 218. 


As is acknowledged at the very outset of this small volume, it is 
basically a reprint. The 12 chapters that make it up also made up 
the September, 1966 issue of Scientific American. The title of the 
book is somewhat of a misnomer as the content is concerned with 
the processing, storage, and retrieval of information rather than 
with information as a subject entity itself. The volume attempts to 
deal with both the philosophical and physical problems of com- 
puter technology from modern and historical viewpoints. This is 
a relatively large order for such a small book and, while it does not 
completely cover the ground, it certainly is a sound primer for a 
basic understanding of the capabilities and complexities of today’s 
high speed data processing operations. 

John McCarthy’s introduction, which bears the same title as the 
book itself, is an excellent short history of the use of computers 
and the problems in internal space that have been solved in the 
past three decades. But more important, McCarthy explores the 
social problems inherent in computer technology in terms of ques- 
tions concerned with who (or what) the computer is replacing and 
who shall have access to the data stored within the memory banks 
of such systems. While these problems are hinted at in succeeding 
chapters, they are never again fully responded to. 

The five chapters following this introduction can be viewed as a 
Section dealing with computer systems and software. Here, by 
software is meant not only adjunct machines but also programs 
and internal system subroutines. In fact, software includes every- 
thing but the actual operating machine itself. A knowledge of the 
technical information contained in these chapters is necessary for 
à full understanding of what is to follow. For example, Suppes’ 
later chapter on “The Uses of Computers in Education” contin- 
ually returns to the advantages of using a visual display matrix. A 
Comprehensive explanation of this type of matrix is contained in 
Sutherland’s chapter dealing with the forms of input and output 
software available. 

Two chapters in this section (those by Fano & Corbaté and by 
Pierce) are especially interesting to the researcher in the behavioral 
Sciences whose problems require a relatively short time on the 
hardware portion of the system. Dealing with time-sharing and 
data transmission, the chapters are clear expositions of early prob- 
lems faced in this area and the solutions that were achieved with 
Project MAC (multiple-access computer) at M. I. T. The capa- 

ilities of transmitting equipment, Pierce maintains, have now 
Teached a level of sophistication where transmission of data to a 
large computing center is not only a possible alternative to having 
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a smaller computer physically present but, in many instances, is 
actually the alternative of choice. 

The next section of four chapters dwells upon the use of com- 
puters in science, technology, orginizations, and education. While 
relatively interesting from an historical point of view, these chap- 
ters provide the reader with nothing new and, because of the 
anecdotal style, with very little that can be applied elsewhere. 

"The last two chapters of the book must stand alone, refusing to 
be placed in sections, as probably the most interesting and chal- 
lenging. Lipetz not only diseusses the relatively simple concepts 
of elementary storage and retrieval but also continues into more 
complex tasks such as the storage and retrieval of a full university 
library. The discussion of the problems involved in such a task 
and some of the suggested resolutions to the problem is an excel- 
lent one. While this reviewer does not necessarily agree with all of 
the proposed solutions, it was felt that the movement of the book 
from anecdotal history to current problems and proposed solutions 
was a welcome trip. 

The last chapter is also an excellent one. Minsky is dealing with 
the problem of “Artificial Intelligence” and, after a brief introduc- 
tion to the early cybernetic movement, he proceeds through a 
short discussion of Newell, Shaw, and Simon’s General Prob- 
lem Solver and into the work of his own students at M. I. T. The 
discussion of Evans’ program for handling geometric analogies is 
excellent, and the reproduction of portions of the program at 
crucial points in this discussion adds a great deal to the understand- 
ing of the problem. Minsky continues into the solution of verbal 
problems and ends this part of the discussion with the description 
of a procedure whereby the computer takes as input a two-dimen- 


$ given input model, the solution of 
aon structured problems becomes possible, and it is the facility to 
orm these solutions which constitutes intelligence. 

"Thus, this book appears to b 
levels, One of these levels, tha 
tion, is interesting but is also 
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of articles, it might serve well in an academic course devoted to 
computer applications and problems in the behavioral sciences. 


Steven С. GOLDSTEIN 
University of Oregon Medical School 


Fundamentals of Guidance, by Bruce Shertzer and Shelley C. 
Stone. Boston: Houghton Mifflin Company, 1966. Pp. xviii + 
526. $7.50. 


In the vast literature exploring the meaning of guidance and the 
function of guidance workers, there has been an obvious need for a 
comprehensive work that portrayed the guidance point of view and 
the work of the counselor in the school. Where others have made 
assumptions, formulated theories, and criticized present practices, 
Shertzer and Stone have focused attention upon today’s school and 
the modern adolescent. The characteristics of adolescents, their 
major concerns, and their relationship to school and society are 
considered in the first chapter of Fundamentals of Guidance. 

Despite the fact that parents and educators have been involved 
in innumerable experiments and studies, there are many indications 
that adolescents are not understood by their parents, their teachers, 
or their counselors. Although the authors’ discussion of adolescents 
is not extensive, it does cite some of the major problems that con- 
front modern youth. In addition, Shertzer and Stone have made 
Suggestions which will enable schools to assist in combating and 
alleviating conditions that prevent adolescents from becoming 
participating and contributing members in today’s society. 

Because the scope of the book is broad, it seems appropriate to 
cite the goals or objectives of the work as outlined by the authors. 
The objectives are: 


1. To acquaint students with the fundamental subject matter of 
guidance—the individual and his development and needs. 

2. To provide students with a framework from which they may 
gain a perspective of what guidance has been, what it is now, 
and what it may become in the future. 

8. To provide students with an orientation to the services of 
guidance-their purposes, their make-up, and their strengths 
and limitations. 

4. To help students understand the problems and issues con- 
fronting present-day guidance practitioners as well as the 
tationale behind the patterns of behavior ascribed to pro- 
fessional personnel. 

- To clarify for students the trends that are emerging in the 
field of guidance, 


e 
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a smaller computer physically present but, in many instances, is 
actually the alternative of choice. 

The next section of four chapters dwells upon the use of com- 
puters in science, technology, orginizations, and education. While 
relatively interesting from an historical point of view, these chap- 
ters provide the reader with nothing new and, because of the 
anecdotal style, with very little that can be applied elsewhere. 

The last two chapters of the book must stand alone, refusing to 
be placed in sections, as probably the most interesting and chal- 
lenging. Lipetz not only discusses the relatively simple concepts 
of elementary storage and retrieval but also continues into more 
complex tasks such as the storage and retrieval of a full university 
library. The discussion of the problems involved in such a task 
and some of the suggested resolutions to the problem is an excel- 
lent one. While this reviewer does not necessarily agree with all of 
the proposed solutions, it was felt that the movement of the book 
from anecdotal history to current problems and proposed solutions 
was a welcome trip. 

The last chapter is also an excellent one. Minsky is dealing with 
the problem of “Artificial Intelligence” and, after a brief introduc- 
tion to the early cybernetic movement, he proceeds through а 
short discussion of Newell, Shaw, and Simon’s General Prob- 
lem Solver and into the work of his own students at M. I. T. The 
discussion of Evans’ program for handling geometric analogies is 
excellent, „and the reproduction of portions of the program at 
crucial points in this discussion adds a great deal to the understand- 
ing of the problem. Minsky continues into the solution of verbal 
problems and ends this part of the discussion with the description 
of a procedure whereby the computer takes as input a two-dimen- 
sional representation of a three-dimensional object and then at- 
tempts to analyze the object into its most elementary three-dimen- 
sional forms (e.g., rectangles and prisms). 
pus ее descriptions of current work, Minsky goes on to 
d з what 1s happening internally within the computer is 
eee оше stirrings of rudimentary verbal and abstract intel- 
р * P within a given input model, the solution of 
pisa problems becomes possible, and it is the facility to 
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of articles, it might serve well in an academic course devoted to 
computer applications and problems in the behavioral sciences. 


Steven G. GOLDSTEIN 
University of Oregon Medical School 


Fundamentals of Guidance, by Bruce Shertzer and Shelley C. 
Stone. Boston: Houghton Mifflin Company, 1966. Pp. xviii + 
526. $7.50. 


In the vast literature exploring the meaning of guidance and the 
function of guidance workers, there has been an obvious need for a 
comprehensive work that portrayed the guidance point of view and 
the work of the counselor in the school. Where others have made 
assumptions, formulated theories, and criticized present practices, 
Shertzer and Stone have focused attention upon today’s school and 
the modern adolescent. The characteristics of adolescents, their 
major concerns, and their relationship to school and society are 
considered in the first chapter of Fundamentals of Guidance. 

Despite the fact that parents and educators have been involved 
in innumerable experiments and studies, there are many indications 
that adolescents are not understood by their parents, their teachers, 
or their counselors. Although the authors’ discussion of adolescents 
is not extensive, it does cite some of the major problems that con- 
front modern youth. In addition, Shertzer and Stone have made 
suggestions which will enable schools to assist in combating and 
alleviating conditions that prevent adolescents from becoming 
participating and contributing members in today’s society. | 

Because the scope of the book is broad, it seems appropriate to 
cite the goals or objectives of the work as outlined by the authors. 


The objectives are: 


1. To acquaint students with the fundamental subject matter of 
guidance—the individual and his development and needs. 

2. To provide students with a framework from which they may 
gain a perspective of what guidance has been, what it is now, 
and what it may become in the future. — j 

3. To provide students with an orientation to the services of 
guidance-their purposes, their make-up, and their strengths 
and limitations. j 

4. To help students understand the problems and issues con- 
fronting present-day guidance practitioners as well as the 
rationale behind the patterns of behavior ascribed to pro- 


fessional personnel. d 
5. To clarify for students the trends that are emerging in the 


field of guidance, 
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6. To help students develop some appreciation of the field, its 
practitioners, its aspirations, and its problems. 


The objectives, as stated by the authors, give indisputable evidence 
of the multiplicity of areas considered in this textbook which is 
designed primarily for beginning students in their study of organ- 
ized guidance history, present practices, current procedures, and 
future trends. 

Because there have been many divergent views in guidance, 
Shertzer and Stone have reviewed significant aspects of its history 
and identified some of its basic principles. Since there was con- 
siderable difference in the vantage points from which the early 
leaders—Meyer Bloomfield, Frank Parsons, John Brewer, William 
Proctor, E. G. Williamson, Carl Rogers, and Arthur Jones—viewed 
the process of guidance, the authors leave the clear impression that 
the differences in the approaches of these men grew out of their re- 
spective experiences with young people in this ever-expanding 
democracy. 

One of the principal values of this volume lies not in its offering 

of solutions (although there is one for virtually every precon- 
ceived position that a guidance worker may have and hence wish to 
document) but rather in its portrayal of the divergent opinions 
with regard to pertinent issues that are under discussion among 
educators at the local, state, and national level. Common to such 
discussion among educators is the tendency to generalize and to 
assume a universality regrettably not documented. Shertzer and 
Stone’s thorough examination of critical issues in the guidance 
process should elicit serious thought for better solutions to the 
problems that confront the adolescent and the school. 
i Poor home-school relations are often partially responsible for 
ineffective guidance programs. Fundamentals of Guidance might 
have been more valuable to its readers if the importance of parent 
participation programs and parent involvement had been described 
and emphasized. Failure to give adequate consideration to par- 
ents’ role in the total education of children did not seriously impair 
the quality of the book, but present interest in this vital area nearly 
demands that new textbooks give some direction to this phase of 
guidance. 

Throughout this volume, Shertzer and Stone have made an at- 
tempt to provide the reader with thorough knowledge of current 
philosophy, accepted practices, and reference materials that are 
ee acceptable by most authorities in the field. Indeed, the 
| И E apter of the book is devoted to the major trends in guidance 
Puy QU HOD RH educators, and concerned citi- 

per understanding of the problems that 
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confront school and society through a study of this interesting and 
informative book. 

ALTHA WILLIAMS 

San Diego City Schools 


Counseling and Psychotherapy: An Overview edited by Dugald S. 
Arbuckle. New York: McGraw-Hill Book Company, 1967. 
Pp. 226. 


Students concerned with counseling psychology will find this 
book interesting and provocative. The major controversial issues 
that are fostered in this collection of essays are the nature of man, 
his freedom, and his status as an emotional or intellectual being. 
The various authors subscribe their viewpoints, either implicitly, or 
explicitly, to the three issues which make for a most enj oyable book 
to read. 

Arbuckle has attempted to survey the area of counseling and 
psychotherapy, ‘with major emphasis upon existentialism, and to 
bring together under one cover the writings of men who are in- 
fluencing this field with their ideas of what counseling and psycho- 
therapy are trying to accomplish. This volume is similar to Stef- 
flre's book of Theories of Counseling. Arbuckle prepared the 
motif with the introduetory chapter and then in the final chapter 
compared each author with the other on the important issues fos- 
tered throughout the book. Arbuckle is to be commended for his 
personal comments at the end of each chapter. Too often during this 
past decade editors have put together books of essays without com- 
menting on the chapters. This refreshing approach at least, brings 
out the highlights of each chapter along with the limitations as 
viewed by the editor. d 

'The major part of the book consists of presentations by four 
theorists who offer their viewpoints about counseling and/or psy- 
chotherapy. The four orientations are existential psychology, writ- 
ten by Van Kaam; religion and its relations to counseling by Cur- 
ran; rational-emotion psychotherapy by Ellis; and the use of self in 
psychotherapy by Felder. The remaining chapters handle the im- 
plications of certain theories, whether or not they should be used, 
and their future benefit to counseling. O'Hara has presented voca- 
tional psychology as а substitution for the counseling process; Pat- 
terson advocates that psychotherapy should be in the schools; 
Landsman discusses the personality theory as relating to counseling; 
and Bergin writes about the recent empirical analysis of therapeutic 
issues. 

One of Freud’s tenets was that unless there is an emotional in- 
volvement on the рањ of the counselee, the counseling process will 
have no curative effect. Felder, Van Kaan, and Curran, to varying 
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degrees, support Freud's supposition. However, Patterson does not 
indicate within this book whether or not he agrees with this tenet. 
Arbuckle points out that Patterson does uphold the subjective de- 
velopment of counseling. Felder and Van Kaam would likely rep- 
resent the extreme position about the emotional involvement, while 
O'Hara and Ellis take the opposite viewpoint of this issue. The other 
authors are dispersed somewhere between the two extremes of this 
subjective-objective dimension. 

Another important issue that Patterson points to is the prepara- 
tion of the counselor. The recent trends have been away from the 
old philosophy that the counselor should have had teaching experi- 
ence. Arbuckle also expands on this new trend by stating that “... 
an increasing number of highly motivated young people who wish 
to become counselors in a school system are turned away because, as 
many have said very simply to me, ‘I do not want to be a teacher, 
but I would like to become a counselor.’ When one considers the 
desperate social need, which has been recognized by the Federal 
Government, for the services of young, motivated, professionally 
competent counselors, it seems at least close to idiotic to say to 
such a young person, ‘You must go through a teacher-preparation 
program, then teach for one, two, or three years before you can be 
considered for a counselor-education program.’ This is simply a 
waste of human talent.” 

This issue is crucial to the future development of counseling 
programs, and to the counselors who will be educated. By recog- 
nizing counseling as a profession separate from teaching, the coun- 
seling positions will not be used as stepping stones by which teach- 
ers move up to principalship. When school systems have stopped 
this intermediate maneuvering which has plagued the counseling 
area, they can then better educate the perspective counselors who 
can deal with the students’ emotional and intellectual problems in- 
stead of existing as mere information dispensers. 

The chapters by Van Kaam, Curran, and Felder give the student 
an exposure to the process of counseling and/or psychotherapy. 
bi ipd of these chapters have, to various degrees, given of 

emselves in their writing about the counseling process. Both 
Felder and Van Kaam have been subjective and transparent in 
their presentations, and from this reviewer's viewpoint their methods 
of presentation are most beneficial for someone considering this 
m of work. In the reviewer's opinion this style of writing should 
hona rela the objective presentation which is so 
ann uh ү а very impressive overview for the counselor, 

E Я prospective student who will be embarking upon this 
area for future work. The presentations by the different authors are 
insightful, and beneficial. Arbuckle’s book should be in the hands 
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of all students in the area of counseling and/or psychotherapy and 
should be, if not principally an assigned textbook, at least a neces- 
sary reference for students. 

С. Вовевт WARD 

University of California, Santa Barbara 


Signal Detection Theory and Psychophysics, By David M. Green 
and John A. Swets. New York: John Wiley and Sons, Inc., 
1966. Pp. + 455. 

This compact book is all meat. The authors have artfully packed 
into 416 pages a “full and systematic presentation of the field” 
plus three expository appendices and 276 references. Although the 
text is liberally sprinkled with formulae and derivations, a knowl- 
edge of calculus is not essential, and the serious student should have 
little trouble understanding the concepts and processes as verbally 
stated. Clearly written and well organized, the book is divided into 
three sections. 

Part I, Decision Processes in Detection, introduces the reader to 
statisteal decision theory and describes its application to basic psy- 
chophysical procedures. Chapter 3 discusses assumptions and the- 
ories of signal and noise distributions. Empirical data obtained with 
three of the basic psychophysical methods are presented and com- 
pared in Chapter 4. The last chapter in this section critically re- 
views the classical and current sensory threshold theories. Predic- 
tions based on these theories are evaluated by reference to obtained 
data. 

Part II, Sensory Processes in Dectection, discusses several mod- 
els of Ideal Detectors and compares the performance of human 
observers with the predictions of two such models. The section 
describes in detail a simple energy detection model which Green 
and Swets feel “. . . provides a rather good analogue to the hu- 
man auditory detection process." Empirical data are shown to fit 
well with predictions based on this particular model. ў 

Part III, Other Applieations of Detection Theory, contains 
chapters on multiple observations, auditory frequency analysis, and 
speech communication. The final chapter demonstrates the poten- 
tial utility of detection theory as a heuristic. Areas covered are 
animal psychophysics, sensory physiology, reaction time, vigilance, 
time discrimination, stimulus interaction and attention, subliminal 
perception, and recognition memory. 

For the seasoned researcher or mature graduate student inter- 
ested in psychophysics or perception, this book is “required” 
reading. 
Joun Е. DePavit 
Human Factors Research, Inc. 
Santa Barbara, California 


922 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Staff Leadership in Public Schools by Neal Gross and Robert E. 
Herriott. New York: John Wiley & Sons, Inc., 1965. Pp. 247 
$6.95. 


"The authors report findings derived from the National Principal- 
ship Study, a sociological inquiry undertaken to increase knowledge 
about the performance of formal leaders in schools—with special 
reference to the school principal. The report of the study is well 
stated with full descriptions of the theoretical framework which 
guided the data gathering, the instruments, and the analytical pro- 
cedures which were utilized to interpret the data. The central instru- 
ment was designed to measure “Executive Professional Leadership” 
(EPL) which is defined in the study as “the efforts of an executive 
of a professionally-staffed organization to conform to a definition of 
his role that stresses his obligation to improve the quality of staff 
performance.” 

Measurement of EPL was achieved by completion of question- 
naires by from four to ten teachers in each of 175 elementary 
schools. Teachers in essence were asked to describe the behavior 
of their principals with respect to their efforts to conform to the 
EPL definition. EPL scores were then used in the testing of a large 
number of hypotheses ranging from the relationship of years of 
experiece and EPL to years of graduate study and EPL. The 
authors have compiled in this one volume a rather thorough ex- 
amination of staff leadership in the public schools. The rigor of the 
methodology goes beyond that usually found in research attempts 
centered upon publie education and the reporting is forthrightly 
done. The instruments developed should find profitable use in fur- 
ther research. The authors include in their appendix specimen 
instruments, as well as technical details of factor and scale scoring. 
Of particular interest is Appendix D which presents a model for 
utilizing subordinates as observers of superordinate behavior. In- 
stead of Viewing the variance among the reports of subordinates as 
an indication of unreliability, the model suggests that the variance 
itself may be an indication of leader behavior suggesting that the 
leader behaves differently in relation to particular individuals and 
particular settings, 

The authors report as implications of their study the not too 
startling Suggestion that “a high level of collegiate academic per- 
formance, a high order of interpersonal skill, the motive of service, 
the willingness to commit off-duty time to work and relatively little 
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ate courses in education, number of graduate courses in educa- 
tional administration, sex and marital status." 

The theory upon which the study is based is considerably tighter 
than in most studies related to the public schools. The authors dis- 
cuss their theoretical framework in Chapter 5; and if their work has 
an Achilles heel, it is to be found in that chapter, for while their 
theoretical framework is well laid out it is still a bit gross. Central 
to their theoretical framework is the assumption that the school 
administrator undergoes a double process of socialization. The first 
process they posit is the “preparatory phase” which occurs in in- 
stitutions of higher learning. During this phase the administrative 
candidate is assumed to internalize an ideal concept of his future 
administrative role. The second process the authors specify as the 
“phase of organizational reality"—occurring after the candidate 
has completed his university course work and upon entrance into an 
administrative position. 

The double phase of socialization is an integral part of the au- 
thors’ theoretical framework. Its inadequacy lies in the assumption 
that candidates appear full blown from the brow of Zeus in the 
university. The reviewer suggests that the double phase socializa- 
tion fails to give any consideration to the activities of an adminis- 
trative candidate during his teaching years. Those years in which 
the candidate adopts administration as a positive reference group 
—the years in which he adopts an “Tf T were you” stance toward his 
building principal—the years in which he seeks out administrative- 
like activities while still a teacher—those very years in which he 
attempts to catch the attention of his superordinates. The teaching 
years would appear to be central to an understanding of the social- 
izing process. The authors’ “double phase” socialization takes into 
account only those events occurring within the collegiate setting 
and those after entry into an administrative role. Since the candi- 
date’s positive reference group, his “models” of behavior, exist 
in the schools, not the university, it would appear risky to suggest 
that the initial socialization occurs in the university. Merton’s the- 
ory of anticipatory socialization would appear to offer a better the- 
oretical framework, since it includes the candidate’s membership 
group (teachers) and his positive reference group (administra- 
tors) and explains the process of movement from one to the other 
with its attendant effect of alienation. The authors’ theoretical 
framework falls short of accounting for the socializing process. To 
some extent this may account for the blandness of the overall 
findings. А м 

“Staff leadership in the publie schools” is well worth acquiring. 
It can be expected to generate a great deal of research and com- 
ment. Professors of school administration, and perhaps more justly, 
graduate students of school administration should find much about 
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which to ruminate especially the apparent lack of relationship be- 
tween Executive Professional Leadership and the number of gradu- 
ate units in school administration. 

Rowarp E. Broop 

University of California, Santa Barbara 


Excellence in Education, an Opportunity and a Challenge by Wil- 
liam B. Michael (Editor). Berkeley, California: University of 
California, 1963. Pp. ix + 125. 

During the days of March 29 through March 31 in 1963, the 
first of seven conferences took place on the Santa Barbara Campus 
of the University of California. These seven conferences had their 
thrust toward the problem areas pressed on this multicampus uni- 
versity system by reason of the population explosion in the State of 
California. 

This first conference centered on the problem of providing “ex- 
cellence” in higher education in the face of the many faceted de- 
‚ mands placed on a university not only in the California environ- 
ment but also in a national environment. The title of the March 
1963 conference at the University of California, Santa Barbara, was 
"Excellence in Education, an Opportunity and a Challenge." It was 
chaired by Dr. Gordon S. Watkins, Dean, School of Education. 

To open the commemoration of the 95th year of the founding of 
the University of California, a Charter Anniversary ceremony and 
banquet started the three day symposium. Dr. James B. Conant 
delivered a prophetic address entitled “Public Policy and Excel- 
lence in Education” on Friday at the Charter Anniversay ceremony 
and then delivered another address, “Fewer Slogans, More Facts,” 
at the banquet the same day. 

This heady beginning was sustained to a large degree throughout 
the symposium with a stirring keynote address on the second day 
by Dr. Sterling M. MeMurrin, Professor of Philosophy, University 
of Utah. "The Aims and Purposes of American Education" was 
more than a title of his address. It was a thoughtful discourse and 
logical analysis of the abstract concepts of concern for reason, for 
freedom, and for the individual as the three major elements in the 
American culture that affect the aims and purpoges of education. 

Dr. MeMurrin’s probing address centered ical issues which 
related to the opening thoughts placed so well before the sym- 
posium by Dr. Conant who called for the establishment of a na- 
tional educational policy by the various states to provide coherence 
in the series of inevitable decisions to be made for the program- 
ming and expenditure of federal funds for education. Dr. Conant 
suggested an interstate educational advisory council, “Standing 
Group on Educational Strategy.” Also, he reiterated "the recom- 
mendation made by many individuals over the years that a cabinet 
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office of Secretary of Education be created from the Office of the 
United States Commissioner of Education. 

Dr. Conant, in his reprise, prodded the symposim by stating that 
education in our country is "partly in trouble because we do not 
know how much trouble it is inl" A consequent recommendation 
was made that California gather facts for “an academic inventory.” 
He suggested as one model, the one in his book Slums and Sub- 
urbs. 

The Conference I committee undertook a stimulating program 
with an orderly progression of coherent topics. The editor did 
equally well in putting together this monograph. Between Dr. Con- 
ant’s two addresses, Dr. Clark Kerr and Santa Barbara Chancellor 
Vernon Cheadle defined the growth of the University with ampli- 
fying facts and with reference to the particular effects of this growth 
on the Santa Barbara institution. Following the keynote address 
of Dr. MeMurrin, three speakers, Dr. Henry Chauncey, Dr. Lee Du- 
Bridge, and Tr. Arthur Coons, respectively, presented their 
thoughts on “The Student: Discovery of Potentialities and the 
Means of Bringing Them to Fruition"; "The Teacher: Discovery 
of Potentialities and the Means of Bringing Them to Fruition"; 
“The Environment: Creation of Conditions for the Achievement of 
Excellence in Education." 

Each of these addresses was followed by a panel discussion of 
the topie and of the content of the address. These panel discus- 
sions are not presented in this monograph. 

In fact, reflection on this monograph should enable the reader 
to realize the potential impact that conferences may have upon 
education. At this 1963 conference there is a national assessment 
of education in progress. There is being made the attempt to have 
a national policy of education. There has been thrust at the Univer- 
sity of California and at other universities concerning the problem 
of student individuality. There are being established major edu- 
cational research centers and one is in California. 

The message was certainly made clear not only to scholars in 
the disciplines of education, psychology, and statisties but also 
to sociologists, anthropologists, social psychologists, and historians 
to become deeply inyolved in the study of the educational process. 

C. RUSSELL DE BURLO, JR. 
Tufts University 


The Exceptional Individual: Psychological and Educational As- 
pects by Charles W. Telford and James M. Sawrey. Englewood 
Cliffs, New Jersey, Prentice-Hall, Inc. Pp. x + 482. $7.95. 


Situated under the umbrella of “exceptionality” are the mentally 
retarded, the sensorily deprived, the muscularly limited, and the 
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brain damaged. Yet, there are neither mental abilities nor personal 
and social adjustments common to all. Rather between these con- 
ditions there are often obvious differences in organ involvement, 
degree of limitation, and range of impaired functions. Within each, 
moreover, there is variability in ability, in adjustment, even in de- 
gree and range of limitation. The empirieal and conceptual strains 
on the umbrella have led some theoreticians to sew the split seams; 
others, to replace the ribs. Telford and Sawrey invite the youth 
with speech problems, the socially deviant, and the culturally dis- 
advantaged to join the other disabled persons. To accommodate 
this increased heterogeneity, they provide a canopy which could 
envelop all humanity and not merely the specimens they discuss in 
their seventeen chapters. 

With unusual clarity, the authors identify each condition they 
discuss. They operationally define the disability and cite medico- 
legal definitions where appropriate. Following each definition are 
vivid descriptions of symptoms, enumeration of correlates, discus- 
sion of etiology, and review of home and school management tech- 
niques. So animated are the descriptions that it is easy for the 
sighted to see as the blind, and the hearing to listen as the deaf, the 
able to learn as the retarded, and the less than gifted to originate 
ideas as the creative. Few authors are able to tune auditory, kines- 
thetic, and tactile receivers to a visual transmitter of a textbook. 
Much to their credit Telford and Sawrey accomplish this. Such pro- 
found understanding is impressive. 

The authors contend that by disowning and segregating the ex- 
ceptional individual, society restricts the range of stimuli which he 
could perceive as well as the variety of coping operations which he 
could learn. These restrictions not only confine the exceptional to 
a smaller psychological Space accorded the non-exceptional but 
tell him he is unworthy and hence increase the likelihood of further 
restriction, further lowering of self-esteem, and further dedifferenti- 
ation of abilities. Hence there is a fulfillment of the prophecy 
augured by fictions traditionalized by our culture. 

Social usefulness, the authors aver, is the ultimate criterion which 
ve a considered in determination of exceptionally. They state 

at the exceptional individual usually refers to people who differ 
from the average to such an extent that society perceives them as 
needing special education, or social or vocational training. Rarely 
18 an opportunity missed by the authors to indict the intolerant 
ue! in which we live. Of course the allegations are accompanied 

y some proof. Most readers will feel guilt at their rejecting, at 
sometime in their life, of an exceptional individual; most will feel 
empathy with him as they acknowledge their own limitations. If the 
strength of the book is to be based on the social utility construct, 
however, the authors’ case will require a better defense. Unlike 
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their exacting documentation of most principles discussed in the 
book, their socio-psychological frame work abounds in too many 
inferences far removed from concrete data. “Society determines 
which deviations will be considered disabilities or assets" is, an 
opening statement which leaves the reader thirsting for evidence. 
However, the facts do not follow, perhaps the skeletal treatment 
merely reflects the scant knowledge available in the field. To de- 
termine society’s influence on the production of handicapped in- 
dividuals, it is necessary to answer some preliminary questions. Is 
there but one society? Is there but one set of norms? Is there in 
our society high interjudge reliability on the naming of behaviors 
as deviations? Is there no differential treatment of the exceptional 
by the several members of a "society"? 

The distinction between handicap and disability is coherent with 
the authors’ general point of view. Yet, reading from that point of 
view, this reviewer became confused when the chapters about the 
sensorily and rguseularly limited were headed “handicaps.” The 
book could have a greater thrust were it to have preserved the dis- 
tinctions made. 

This book will be read by practicing school psychologists, teach- 
ers, and administrators of programs for the exceptional because of 
the abundant and well documented information about genetics, 
chemistry, physiology, education, and sociology of the exceptional. 
The variety of chromosomal arrangements found in different types 
of mental retardation is noteworthy. Readers will value the enumera- 
tion of advantages and disadvantages of sheltered workshop, the 
day hospital. The special class will be used as guides by many a 
program planner. 

The recent developments in operant conditioning of the retarded 
will impress the teacher of the exceptional. This book also has 
value for the classroom teacher who daily meets and does not know 
how tojcope with the epileptic, the stutterer, the culturally disad- 
vantaged, and the social deviant. The information it contains and 
the popular emphasis on social adaptability will make this book 
competitive with similar books on the market. : 

Because words like “acceptance,” “social utility,” and “social 
norms” have no clear referrants, readers will invest personal mean- 
ings in the terms. This will detract from the potential contribution 
to education that the book could make. In addition to facts, the 
book has advice, especially for the family. While the scientific ae- 
curacy of this advice is open to question, the sugggestions are rea- 
sonable hypotheses to entertain. Without being encyclopedic, the 
book is thorough; without being overly simple, the presentation is 
vivid. Though hopeful, the authors are circumspect. Their messages 
come through loud and clear. Society has been intolerant, it is 
moving away from its nadir; the exceptional are limited but need 
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not be handicapped; genetics, nutrition, chemistry, environmental 
stimulation explain certain kinds of exceptionality and its treat- 
ment; the exceptional child is to be regarded for his competence, 
not pampered; the exceptional youth is to accept his limitation, not 
succumb to it, Finally the canopy suggested serves as a spring board 
for unification of all conditions of exceptionality. This may ulti- 
mately obviate the use of the terms "nonexception" “average” or 
“normal.” Reading the book is a highly rewarding experience. 


Norman M. CHANSKY 
Temple University 
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* IMPORTANT NOTICE ТО AUTHORS 

Tum Улілртү Этортез SECTION is published twice a year, once in 
the Summer issue and again in the Winter issue, for which the clos- 
ing dates for receiving manuscripts are January first and July 
first, respectively. Although articles between two and eight printed 
pages are usually preferred, an occasional exception is made to pub- 
lish articles of somewhat greater length. 

Considerable flexibility exists concerning format as can be seen 
from a study of recently published articles, However, the model 
presented in the Spring, 1953, issue of EDUCATIONAL AND Psvcuo- 
LOGICAL MEASUREMENT still represents a close approximation to 
what is customarily published. The prospective contributor is en- 
couraged to read the original announcement. 

In order that the usual number of articles of other types may not 
be reduced, it is necessary to enlarge the journal and to charge the 
authors for most of the publishing costs. Because of the financial 
pressure of rising printing costs, the rate for publication of articles 
in this section will be twenty-five dollars per page. The extra cost 
of the composition of tables and formulas will be added to the basic 
rate. One hundred gratis reprints is extended to the author of 
articles appearing in this journal. 

"Two copies of manuscripts should be sent to: 


Dr. William B. Michael 

Professor of Education and Psychology 
School of Education 

University of Southern California 

Los Angeles, California 90007 
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A PARADIGM INVOLVING MULTIPLE CRITERION 
MEASURES FOR THE EVALUATION OF THE 
EFFECTIVENESS OF SCHOOL PROGRAMS! 


NEWTON 8. METFESSEL 


AND i 
WILLIAM B, MICHAEL 
University of Southern California 
. 


Атноовн much of what is to be presented has been said before, 
the writers believe that in the light of the recent emphasis upon 
evaluation of exemplary and innovative school programs receiving 
federal support it may be helpful to set forth a rationale to facil- 
itate their evaluation. Despite the risk of redundancy with the 
previous efforts of Tyler (1942, 1964), Bloom et al. (1956), 
Krathwohl (1964), Krathwohl et al. (1964), Lindvall et al. (1964), 
and Michael and Metfessel (1967), both the paradigm to be pre- 
sented and the list of suggested criterion measures that may be 
used in evaluation of the attainment of objectives in school pro- 
grams may be of some help to teachers, administrators, counselors, 
consultants to public schools, and certain other professional per- 
sonnel whose experience in evaluation may be somewhat limited. 

Purpose. Thus, the twofold purpose of this paper is (1) to pre- 
sent an eight-step procedural outline of the evaluation process in 
the form of a paradigm (or flow chart) that maps out in a step-by- 
step sequence the key features of the evaluation process and (2) to 
furnish a detailed listing of multiple criterion measures that may be 
used in the evaluation of specific behavioral objectives. (For max- 
imum clarity and usability these objectives are preferably stated 
as operational definitions involving measurable and observable 
changes in behaviors that have been judged to be, significant and 

1 Based on a paper given at the 1967 annual meeting of AERA, February 
16, 1967, held in New York City. 
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relevant to the broad goals and the philosophy of the educational 
institution under study.) The experience of the writers in working 
with school personnel has pointed to the need not only for a model 
which can be followed in evaluation of school programs but also for 
a listing of multiple criterion measures of potential utility in the 
instrumentation phase of the evaluation process. 

The paradigm. Since the purposes and assumptions underlying 
the evaluation process as well as ways in which specifie objectives 
in both the cognitive and affective domains can be stated have been 
explicitly formulated in the sources already indicated as well as 
in many textbooks in college courses in educational measurement 
and evaluation, no attention will be given at this point to these spe- 
cific concerns. Instead, emphasis will be placed on a brief state- 
ment of the key steps of the evaluation process, the details of which 
are outlined in the paradigm portrayed in Figure 1. The multiple 
criterion measures that correspond to the fourth segment of the 
evaluation process as shown in Figure 1 are included in the self- 
explanatory appendix. The eight major steps in the evaluation 
process may be enumerated as follows: 

1. Involve both directly and indirectly members of the total 
school community as participants, or facilitators, in the evaluation 
of programs—lay individuals and lay groups, professional person- 
nel of the schools and their organizations, and students and stu- 
dent-body groups. 

2. Construct a cohesive paradigm of broad goals and specific og- 
jectives (desired behavioral changes) arranged in a hierarchical 
order from general to specific outcomes (both cognitive and non- 
cognitive) in a form, for example, that might resemble one or both 
of the taxonomies set forth by Bloom et al. (1956) and Krathwohl 
et al. (1964). Substeps involved in this "Second phase include (a) 
setting broad goals that embrace the philosophical, societal, and 
institutional expectations of the culture; (b) stating specific ob- 
Jectives in operational terms permitting relatively objective meas- 
urement whenever possible and/or empirical determination of 
current status or of changes in behaviors associated with these ob- 
jectives; and (c) developing judgmental criteria that permit the 
definition of relevant and significant outcomes as well as the es- 
tablishment of realistic priorities in terms of societal needs, of 
pupil readiness, of opportunities for pupil-teacher feedback nec- 
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essary in motivating and directing learning, and of the availability 
of staff and material resources. 

3. Translate the specific behavioral objectives into a form that is 
both communicable and applicable to facilitating learning in the 
school’s environment. 

4. Develop the instrumentation necessary for furnishing cri- 
terion measures from which inferences can be formulated concern- 
ing program effectiveness in terms of the objectives set forth. 

5. Carry out periodic observations through use of the tests, 
scales, and other indices of behavioral change that are considered 
valid with respect to the objectives sampled. 

6. Analyze data furnished by the status and change measures 
through use of appropriate statistical methods. 

7. Interpret the data in terms of certain judgmental standards 
and values concerning what are considered desirable levels of per- 
formance on the totality of collated measures—the drawing of 
conclusions which furnish information about the direction of 
growth, the progress of students, and the effectiveness of the total 
program, 

8. Formulate recommendations that furnish a basis for further 
implementation, for modifications, and for revisions in the broad 
goals and specific objectives so that improvements can be realized— 
recommendations which for their effectiveness depend upon ade- 
quate feedback of information to all individuals involved in the 
school program and upon repeated cycles of the evaluation process. 

In addition the paradigm points to the individuals having primary 
responsibility for evaluation at each of the successive stages of the 
process. These responsibilities in the form of facilitating roles are 
indicated by the insertion of the letter L for lay individuals and lay 
groups, the letter P for professional staff members and their pro- 
fessional groups, and the letter S for students and student-body 
groups. These letters are placed in the lower right-hand corner of 
each of the principal blocks of the paradigm. Each block cor- 
responds to a given phase of the evaluation process. Whenever the 
individuals designated by P, L, or S serve as primary or major 
agents (facilitators or inhibitors) at a given stage of the evaluation 
process, the letter is not placed within a parenthesis; whenever 
the individuals serve as secondary or indirect agents, a parenthesis 
is placed ‘around the corresponding letter. The arrows from one 
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block to the next indicate the approximate temporal sequence in 
which the steps occur, although in practice the order may be quite 
varied. 

A few cautions. It should also be emphasized that judgmental 
decisions are involved throughout all phases of the evaluation 
process, as the participants at each stage may be expected to make 
adjustments in their activities in terms of the amount and kinds of 
feedback received. The alert evaluator needs also to be aware that 
Measures may yield indications of false gains or false losses that are 
correlated with (1) experiences in the school environment as well 
as outside the school environment that go beyond the intent of the 
specific behavioral objectives, (2) uncontrolled differences in the 
facilitating effects of teachers and other school personnel (usually 
motivational in origin), (3) inaccuracies in collecting, reading, 
analyzing, collating, and reporting of data, and (4) errors in re- 
search design and statistical methodology, In such situations, the 
wisdom and seasoned judgment of the trained evaluator are par- 
ticularly helpful and necessary if meaningful, honest, and realistic 
conclusions are to be derived from the data obtained in the evalu- 
ation process, 
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APPENDIX 


MULTIPLE CRITERION MEASURES FOR EVALUATION 
OF SCHOOL PROGRAMS 


I. Indicators of Status or Change in Cognitive and Affective 
Behaviors of Students in Terms of Standardized Measures and 
Scales 

Standardized achievement and ability tests, the scores on 
which allow inferences to be made regarding the extent to 
which cognitive objectives concerned with knowledge, com- 
prehension, understanding, skills, and applications have been 
attained, 

Standardized self inventories designed to yield measures of 
adjustment, appreciations, attitudes, interests, and tempera- 
ment from which inferences can be formulated concerning 
the possession of psychological traits (such as defensiveness, 
rigidity, aggressiveness, cooperativeness, hostility, and an- 
xiety). 

Standardized rating scales and check lists for judging the 
quality of products in visual arts, crafts, shop activities, pen- 
manship, creative writing, exhibits for competitive events, 
cooking, typing, letter writing, fashion design, and other 
activities. 

Standardized tests of psychomotor skills and physical fitness. 
Indicators of Status or Change in Cognitive and Affective 
Behaviors of Students by Informal or Semiformal Teacher- 
made Instruments or Devices 

Incomplete sentence technique: categorization of types of 
responses, enumeration of their frequencies, or rating of their 
psychological appropriateness relative to specific criteria. 
Interviews: frequencies and measurable levels of responses to 


II. 
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formal and informal questions raised in a face-to-face inter- 
Togation. 

Peer nominations: frequencies of selection or of assignment to 
leadership roles for which the sociogram technique may be 
particularly suitable. 

Questionnaires: frequencies of responses to items in an ob- 
jective format and numbers of responses to categorized dimen- 
sions developed from the content analysis of responses to 
Open-ended questions. 

Self-concept perceptions: measures of current status and 
indices of congruence between real self and ideal self—often 
determined from use of the semantic differential or Q-sort 
techniques, 

Self-evaluation measures: student’s own reports on his per- 
ceived or desired level of achievement, on his perceptions of 
his personal and social adjustment, and on his future academic 
and vocational plans. 

Teacher-devised projective devices such as casting characters 
in the class play, role playing, and picture interpretation 
based on an informal scoring model that usually embodies the 
determination of frequencies of the occurrence of specific 
behaviors, or ratings of their intensity or quality. 
Teacher-made achievement tests (objective and essay), the 
scores on which allow inferences regarding the extent to 
which specific instructional objectives have been attained. 
Teacher-made rating scales and check lists for observation of 
classroom behaviors: performance levels of speech, music, 
and art; manifestation of creative endeavors, personal and 
social adjustment, physical well being. 

Teacher-modified forms (preferably with consultant aid) of 
the semantic differential scale, 

Indicators of Status or Change in Student Behavior Other 
than Those Measured by Tests, Inventories, and Observation 
Scales in Relation to'the Task of Evaluating Objectives of 
School Programs 

Absences: full-day, half-day, part-day and other selective 
indices pertaining to frequency and duration of lack of 
attendance, 


Anecdotal records: critical incidents noted including fre- 


III. 
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quencies of behaviors judged to be highly undesirable or highly 
deserving of commendation. 

Appointments: frequencies with which they are kept or 
broken. 

Articles and stories: numbers and types published in school 
newspapers, magazines, journals, or proceedings of student 
organizations. 

Assignments: numbers and types completed with some sort of 
quality rating or mark attached. 

Attendance: frequency and duration when attendance is re- 
quired or considered optional (as in club meetings, special 
events, or off-campus activities). 

Autobiographical data: behaviors reported that could be 
classified and subsequently assigned judgmental values con- 
cerning thtir appropriateness relative to specific objectives 
concerned with human development. 

Awards, citations, honors, and related indicators of distinctive 
or creative performance: frequency of occurrence or judg- 
ments of merit in terms of scaled values. 

Books: numbers checked out of library, numbers renewed, 
numbers reported read when reading is required or when 
voluntary. 

Case histories: critical incidents and other passages reflecting 
quantifiable categories of behavior. 

Changes in program or in teacher as requested by student: 
frequency of occurrence. 

Choices expressed or carried out: vocational, avocational, and 
educational (especially in relation to their judged appropriate- 
ness to known physical, intellectual, emotional, social, aes- 
thetic, interest, and other factors). 

Citations: commendatory in both formal and informal media 
of communication such as in the newspaper, television, 
school assembly, classroom, bulletin board, or elsewhere (see 
Awards). 

“Contacts”: frequency or duration of direct or indirect com- 
munications between persons observed and one or more sig- 
nificant others with specific reference to increase or decrease 
in frequency or to duration relative to selected time intervals. 
Disciplinary actions taken: frequency and type. 
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Dropouts: numbers of students leaving school before com- 
pletion of program of studies. 

Elected positions: numbers and types held in class, student 
body, or out-of-school social groups. 


Extracurricular activities: frequency or duration of participa- 
tion in observable behaviors amenable to classification such 
as taking part in athletic events, charity drives, cultural 
activities, and numerous service-related avocational endeavors. 


Grade placement: the success or lack of success in being 
promoted or retained; number of times accelerated or skipped. 


Grade point average: including numbers of recommended 
units of course work in academic as well as in non-college 
preparatory programs. 

Grouping: frequency and/or duration of moves from one 
instructional group to another within a given class grade. 


Homework assignments: punctuality of completion, quanti- 
fiable judgments of quality such as class marks. 

Leisure activities: numbers and types of; times spent in; 
awards and prizes received in participation. 


Library card: possessed or not possessed; renewed or not 
renewed, 


Load: numbers of units or courses carried by students. 
Peer group participation: frequency and duration of activity 


in what are judged to be socially acceptable and socially un- 
desirable behaviors, 


Performance: awards, citations received; extra credit assign- 
ments and associated points earned; numbers of books or 
other learning materials taken out of the library; products 
exhibited at competitive events, 


Recommendations: numbers of and judged levels of favor- 
ableness, 


Recidivism by students: incidents (presence or absence or 
frequency of occurrence) of a given student’s returning to a 
Probationary status, to a detention facility, or to observable 
behavior patterns judged to be socially undesirable (intoxi- 


IV. 
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cated state, dope addiction, hostile acts including arrests, 
sexual deviation). 

Referrals: by teacher to counselor, psychologist, or admin- 
istrator for disciplinary action, for special aid in overcoming 
learning difficulties, for behavior disorders, for health defects 
or, for part-time employment activities. 

Referrals: by student himself (presence, absence, or fre- 
quency). ` 

Service points: numbers earned. 

Skills: demonstration of new or increased competencies such 
as those found in physical education, crafts, homemaking, 
and the arts that are not measured in a highly valid fashion 
by available tests and scales. 

Social mobility: numbers of times student has moved from 
one neighborhood to another and/or frequency with which 
parents have changed jobs. 

Tape recordings: critical incidents contained and other ana- 
lyzable events amenable to classification and enumeration. 


Tardiness: frenquency of. 
Transiency: incidents of. 


Transfers: numbers of students entering school from another 
school (horizontal move). 


Withdrawal: numbers of students withdrawing from school 
or from a special program (see Dropouts). 

Indicators of Status or Change in Cognitive and Affective 
Behaviors of Teachers and Other School Personnel in Re- 
lation to the Evaluation of School Programs 


Articles: frenquency and types of articles and written docu- 
ments prepared by teachers for publication or distribution, 


Attendance: frequency of, at professional meetings or at jn- 
service training programs, institutes, summer schools, colleges 
and universities (for advanced training) from which inferences 
can be drawn regarding the professional person's desire to 


improve his competence. 
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Elective offices: numbers and types of appointments held in 
professional and social organizations. 


Grade point average: earned in postgraduate courses. 


Load carried by teacher: teacher-pupil or counselor-pupil 
ratio. 


Mail: frequency of positive and negative statements in written 
correspondence about teachers, counselors, administrators, and 
other personnel. * 


Memberships including elective positions held in professional 
and community organizations: frequency and duration of 
association. 


Model congruence index: determination of how well the 
actions of professional personnel in a program approximate 
certain operationally-stated judgmental criteria concerning 
the qualities of a meritorious program. 


Moonlighting: frequency of outside jobs and time spent in 
these activities by teachers or other school personnel, 


Nominations by peers, students, administrators or parents for 
Outstanding service and/or professional competencies: fre- 
quency of. 


Rating seales and check lists (e.g, graphic rating scales or 
the semantic differential) of operationally-stated dimensions 
of teachers’ behaviors in the classroom or of administrators’ 
behaviors in the school setting from which observers may 
formulate inferences regarding changes of behavior that re- 
flect what are judged to be desirable gains in professional 
competence, skills, attitudes, adjustment, interests, and work 
efficiency; the Perceptions of various members of the total 
school community (parents, teachers, administrators, coun- 
selors, students, and classified employees) of the behaviors of 
other members may also be obtained and compared, 


Records and reporting procedures practiced by administrators, 


counselors and teachers: jud ts of i 
сон. Judgments of adequacy by outside 


Termination; frequen: i i 
: Mon Cy of voluntary or involuntary resigna- 
tion or dismissals of School personnel, eds 


--— —— 5. 


ү. 
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Transfers: frequeney of requests of teachers to move from 
one school to another. 
Indicators of Community Behaviors in Relation to the Eval- 
uation of School Programs 
Alumni participation: numbers of visitations, extent of in- 
volvement in PTA activities, amount of support of a tangible 
(financial) or a service nature to a continuing school program 
or activity. » 
Attendance at special school events, at meetings of the board v 
of education, or at other group activities by parents: fre- 
quency of. 
Conferences of parent-teacher, parent-counselor, parent-ad- 
ministrator sought by parents: frequency of request. 

. 

Conferences of the same type sought and initiated by school 
personnel: frequency of requents and record of appointments 
kept by parents. 
Interview responses amenable to classification and quantifi- 
cation. 
Letters (mail): frequency of requests for information, ma- 
terials, and servicing. 
Letters: frequency of praiseworthy or critical comments about 
school programs and services and about personnel participat- 
ing in them. 
Participant analysis of alumni; determination of locale of 
graduates, occupation, affiliation with particular institutions, 
or outside agengies. 
Parental response to letters and report cards upon written or 
oral request by school personnel: frequency of compliance by 
parents. 
Telephone calls from parents, alumni, and from personnel in 
communications media (e.g., newspaper reporters) : frequency, 
duration, and quantifiable judgments about statements mon- 
itored from telephone conversations. 


Transportation requests: frequency of. 
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PATTERN PREDICTION OF ACADEMIC SUCCESS 
L CLIFFORD Е. LUNNEBORG anv PATRICIA W. LUNNEBORG Ж 
| University of Washington 


Usixa patterns in prediction is more than simply taking into ac- 
count a number of different predietors. Ordinary multiple regres- 
sion techniques already provide the means for linearly weighting 
p several measures in the prediction of some attribute or perform- 
à ance. As typically employed, however, this technique falls short 
of utilizing pattern information because the weight assigned any 
particular variable remains fixed, independent of the level of other 
variables. For example, in predicting success in college mathemat- 
ics on the basis of high school mathematics grades and some test 
of quantitative aptitude, the contribution of a given grade average, 
say a “B,” is the same for students of low, medium, and high quan- 
titative aptitude. All this is not to say, however, that patterns cannot 

be adapted to the multiple regression solution. 
г There are two traditions regarding the use of pattern data in pre- 
| diction studies, The first might be called the classification approach 
| and is exemplified by the configural seale of Lubin and Osburn 
(1957) and by the significant pattern method of Forehand and 
| MeQuitty (1959). This approach emphasizes the unique assign- 
ment of each individual to one pattern. Pattern information over a 
given set of variables is assumed to be best, represented by finding 
the single, most congruent pattern for every person. This assign- 


А ment then substitutes for the original variables as the basis for pre- 
| diction. MeQuitty's more recent work on typing’ and hierarchical 
у typing (19662, 19661) used this strategy. Attempts to use the 


classification approach for prediction have invariably led to severe 
shrinkage in cross-validation (Forehand and McQuitty, 1959; Lee, 


1956). 
945 
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The second tradition stems from Horst’s suggestion (1954) that 
subpatterns of scores contain-predictive variance. Studies belonging 
to this subpattern scoring approach include Alf (1956), Horst 
(1957), Lunneborg (1959), and Wainwright (1965). This ap- 
proach assumes the individual can be assigned a multitude of scores 
each based on a subpattern. These subpattern scores can then be 
used with the original variables as a basis for prediction. This 
second approach emphasizes the development of more effective 

pays of scoring a given set of variables. 

The present investigation represents an extension of this latter 
approach to the problem of estimating student success in each of a 
number of academic areas, Although the drop in validity when the 
classification approach is extended to a new sample may be a 
function of the approach per se, it might also be a function of 
utilizing too global a criterion (Forehand and McQuitty, 1959; 
Lee, 1956). Equally important, however, in the present study are 
the routines adopted for reducing the number of subpatterns from 
the total number dictated by all possible configurations, 

It is not only computationally infeasible to consider all possible 
patterns in a given study but the weights derived on a particular 
sample inevitably undergo the shrinkage typical in cross-validation 
when any large number of variables enters into multiple regression. 
Therefore, all attempts at the subpattern approach have included 
techniques for reducing the number of subpatterns studied. Alf 
(1956) relied on a single predictor selection analysis while Horst 
(1957) and Lunneborg (1959) introduced tests of the potential 
effectiveness of groups of these subpatterns. In the present study, 
ап initial screening of patterns was undertaken to eliminate those 
not occurring reliably from sample to sample or occurring with but 
very limited frequency. Reliable patterns were then studie system- 
atically, by determining the increase in predictive efficiency of 
adding to the original variables first the two-variable patterns of 
scores, then three-variable patterns, and so on, with cross-valida- 
tion at each stages 
, Subjects. A randomly selected group of 3,000 high school sen- 
tors who had participated in the Washington Pre-College (WPC) 
Testing Program in January 1964 were administered a second bat- 
tery three months later, ice data were available on 179 men 


, Е 
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and 127 women who entered the University of Washington in 1964 
and 1965. edits 

Predictors and criteria, The first battery consisted of ће WPC 
tests of: English usage, spelling, reading comprehension, mechani- 
cal knowledge, spatial relations, applied mathematics, vocabulary, 
mathematics achievement, and a three-part test of quantitative 
aptitude. The second battery administered consisted of: Guilford- 
Zimmerman Aptitude Survey Part I, verbal comprehension and Part 
VII, mechanical knowledge; the American Council on Education 
Psychological Exam, quantitative part only; Cooperative English 
Test, Form OM, Part I, English usage and Part II, spelling; and 
the College Entrance Examination Board achievement test in in- 
termediate mathematics, Form CAC 1. The purpose of this dual 
testing, incidentally, was to replace the second battery of older, 
commercially ‘available tests with a unique set designed for the 
Program by Educational Testing Service. In addition to these tests, 
predictors included sex, age, and six high school GPA’s: English, 
mathematics, foreign language, social science, natural science, and 
electives. 

The criteria were the average grades over 1964-66 at the Uni- 
versity of Washington in four areas, English, mathematics, foreign 
language, and physical science, as well as the dichotomous criter- 
ion, satisfactory progress towards the bachelor’s degree, defined as 
86 or more credit hours per year and GPA no less than 2.00. 

Procedure. The batteries together with the nontest predictors 
were separately factor analyzed (Lunneborg, 1967). The first 
seven principal axis factors of each analysis were then matched 
according to the oblique maximum variance method (Horst, 
1965). These analyses were based on the entire group of 3,000 
students. Factor scores for the first five maximally congruent vari- 
ables їй the two sets were then determined for those 306 subjects 
for whom criterion data were available. These factor scores were 
computed in standard score form with a mean of 50 and standard 
deviation of 10 based on the group of 3,000. 

These factor matching scores were used in thé absence of test- 
retest data on either batter. Test-retest information was necessary 
for the definition of consistent patterns. It has previously been 
shown that these two test batteries have very congruent factor 
structures (Lunneborg, 1967) and, consequently, that admin- 


^. 
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istering both provides the necessary reliability data for factor 
Scores. However, because these matched factors have been defined 
so as to be maximally congruent, rather than with any simple or 
psychological structure in mind, it is not possible to describe them 
simply in terms of certain battery tests, e.g., verbal aptitude, high 
school achievement, general intellectual ability, etc. Rather, each 
of the five factors was complexly related to all battery elements. 
For this reason the factors are throughout merely identified as 
Factors 1 through 5. р 

Each subject thus had five factor scores for each battery. These 
factor scores were trichotomized, labeling as high, scores above 
56.6 (fourth quartile in a normal population), as medium, scores 
in the range 43.4 to 56.6, and as low, scores of 43.4 and below 
(first quartile). A pattern was thus defined as any combination of 
these trichotomized factor scores, for example, high Factor 1— 
medium Factor 4, which occurred in any subject on both admin- 
istrations, The maximum number of patterns possible for any in- 
dividual based on five scores is 26, 


however, this maximum could only be reached if the two sets of five 
trichotomized scores were identical, 


There are 1,008 possible patternings of five, four, three, and 
two trichotomized variables, 


M 5! : 
(2) " 2 (8 sabes = zl = 1,008. 
In any sample not all of these 1,008 patternings may occur or some 
of them may occur with such low frequency as to be of no practical 
interest. In the current study the only patterns investigated were 
those which occurred in more than 5 per cent or 15 of the subjects. 
Subjects were tandomly divided into two equal sized groups, one for 


Tegression weight determination and the other for validation of 
weights. 
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only as long as the expected shrunken multiple correlation (R,) did 
not drop. It was these correlations from cross-validation by which 
patterns were assessed as predictors. The following procedure was 
followed for each criterion: the first multiple regression analysis 
involved only the five factor scores from the first administration. 
"Then five predictor selections were conducted adding to the factor 
scores the 50 two-variable patterns which had beeen found, ten at a 
time. As many variables as were selected in each of these five se- 
quential analyses were submitted for a final one- and two-variable 
sequential analysis to produce the best set of predictor variables 
(with ten as the limit) for a given criterion. Following this stage, 
the 42 three-variable patterns which had been found were added 
to the best set of one- and two-variable predictors, fourteen at a 
time, making three sequential analyses. As many variables as 
were selectedein each of these three analyses were submitted for a 
final three-variable sequential analysis to produce the best set of 
predictor variables (with ten again as the limit) for a given cri- 
terion. Lastly, to this best set of one-, two-, and three-variable 
predictors were added the seven four-variable patterns for a final 
selection of the best ten predictors for each criterion. 

Results and Discussion. As mentioned above, in the sample of 
306 Ss there were 50 two-variable, 42 three-variable, and 7 four- 
variable patterns which occurred consistently and with a frequency 
of at least 15. Intercorrelations among these patterns, the five fac- 
tor scores and five criteria for the weighting sample formed the 
basis for all subsequent predictor selections. 

As seen from Table i, in the predictor selections involving 
only factor scores no more than three factors were selected for any 
criterion. Equally noteworthy was the selection for all criteria of 
Factor 8, giving it the appearance of a general factor. These results 
were a function of the rotation of the battery factors to maximal 
congruence and doubtless a rotation to simple structure would not 
have produced the same selections. 

With respect to the selections involving patterns, there was con- 
siderable variation among the criteria in the extent to which factors 
as opposed to patterns were selected. There was also variation not 
obvious in Table 1 in the order in which the two were selected. 
For both English and mathematics two of the first three variables 
selected were always factors. In contrast, only Factor 3 was in- 
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volved in the selections for foreign languages and satisfactory prog- 
ress although it was always first selected for the former and second 
selected for the latter. For physical science, factors were never 
included within the first four predictors selected. The predictor 
selection techniques tended to insure that earlier selected variables 
received the greater weight. 

These differences among criteria in predictors selected and in 
orders of selection were reflected in the cross-validation results 
where the most general conclusion must be that patterns lack pre- 
dictive stability from sample to sample. Where more than one fac- 
tor was weighted and where factor weightings were relatively 
heavy, cross-validation correlations held up. But where patterns 
were more heavily weighted, the very high correlations in the 
weighting sample underwent marked shrinkage in cross-validation. 

The failure of pattern information to aid prediction confirms the 
negative results of earlier attempts to use patterns and is all the 
more poignant a failure because of the attention paid to the selec- 
tion of differentiated criteria and relevant predictors, There are 
even well-developed ideas throughout the educational literature 
as to the different configurations of abilities intuited behind dif- 
ferent achievement criteria. In response to similar speculations in 
the counseling literature regarding patterns of personality needs 
associated with academic achievement, a study of reliable, fre- 
quent EPPS need patterns demonstrated the same lack of predictive 
stability (Lunneborg and Lunneborg, 1967). Given the content 
similarity in the present study between predictors and criteria, the 
use of only reliable patterns, and the refinement of criteria, there 
would seem to be small room for continuing the conjecture that 
patterns can go above and beyond prediction from simple linear 
functions of original variables. 
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UNIQUENESS OF SELECTED EMPLOYMENT APTITUDE 
TESTS TO A GENERAL ACADEMIC GUIDANCE 
BATTERY? 


CLIFFORD E. LUNNEBORG ax» PATRICIA У. LUNNEBORG 
University of Washington 


Tux growth of the community college system places new and 
heavy demands on college testing programs. Such programs must 
now offer high school seniors predictions of success not only in tra- 
ditional aeademie course work but in vocational-technical study as 
well. There are two initial approaches to providing this ex- 
panded guidance. First, traditional predictors must be revalidated 
in the community college setting, and second, experimental pre- 
dictors must be evaluated. 

To meet this latter need for new predictors most directly, the 
full range of promising tests would have to be administered together 
with the established battery to the total sample serviced by the 
battery. The largest possible number of subjects would be required 
for sufficient criterion data to be produced in many different areas 
of study in as short a time as two to three years. It is virtually im- 
possible for any program to achieve this ideal. The number of 
experimental tests is too great, the time to administer them too 
small. Further, what little experimental testing time is available 
must also be used to maintain the battery as it stands (alternate 
forms, timing changes, etc.). Because of the necessarily long wait 
for criterion data to validate an experimental test, the selection of 
such tests is critical, as 2-3 years may be wasted “validating” a use- 


less selection. 
This study proposes a two-stage strategy for evaluating experi- 


1 This investigation was supported by the Washington Pre-College Testing 
Program, University of Washington, Seattle. 
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mental predictors so as to minimize the risk of wasted testing and 
validation time. If an experimental predictor is to extend the valid- 
ity of a battery to more diverse criteria, it must demonstrate 
uniqueness with respect to that battery. Because this characteristic 
of uniqueness can be tested independent of criterion data, this 
evaluation can precede the long and expensive task of collecting 
validational information. Experimental tests eliminated because 
of communality with the established test battery can immediately 
be replaced with other potential additions. Only for tests with dem- 
onstrated uniqueness need criterion data be collected. 

Equally imperative to this strategy, if it is to be realized quickly, 
is the simultaneous evaluation of many potential predictors. This 
сап be accomplished by subdividing the testing population into 
matched subgroups which receive different experimental materials 
with the established battery. In this study the established battery in- 
cluded measures of mechanical reasoning and spatial ability in ad- 
dition to verbal and quantitative aptitudes. Of interest was the pre- 
dictive potential of several employment aptitude tests selected for 
their seeming dissimilarity to these tests and for their similarity to 
several vocational criteria. 

The extent to which this particular traditional battery predicted 
vocational-technical criteria in community colleges has been recently 
established (Lunneborg and Lunneborg, 1967). Grades in auto 
mechanics, electronics, secretarial studies, welding, etc., were found 
as predictable as university academic areas using the traditional set 
of predictors which included different high school GPA’s. How- 
ever, because of the considerable contribution of mechanical rea- 
soning to prediction, it was concluded that there was room for im- 
provement from additional variables. This conclusion was all the 
more true because the traditional battery in question was designed 
for differential prediction. Thus, as the criteria are expanded and 
become more disparate, the best set of differential predictors will 
need to tap an even wider variety of skills and aptitudes. 

Subjects. High schools participating in the Washington Pre-Col- 
lege (WPC) Testing Program autumn 1966 were classified accord- 
ing to two standards, size of graduating class and proportion of 
seniors going to college (i.e. proportion participating in the 1965- 
66 WPO program), and were then divided into six groups matched 
on these classifications. The different experimental materials to be 
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evaluated were similarly divided into six experimental “batteries” 
each 30 minutes long which were randomly assigned to the six high 
school groups. The subjects consisted of 22,500 seniors who com- 
pleted the regular battery and one of the experimental batteries 
in a single testing day. Total testing time was six hours. The sample 
contained 49 per cent females and had a mean age of 17. 

Established test battery. WPO tests include vocabulary, English 
usage, spelling, reading speed, reading comprehension, quantitative 
skills, applied mathematics, mathematics achievement, spatial abil- 
ity, and mechanical reasoning. Other variables used in academic 
prediction include sex, age, and six high school GPA’s: English, 
mathematics, foreign languages, natural science, social studies, and 
electives. 

Experimental employment aptitude tests. A biographic survey, 
three subtests ef the U. S. Employment Service General Aptitude 
Test Battery (GATB), and four subtests of the Employee Aptitude 
Survey (EAS) made up the experimental materials. The selected 
САТВ tests have validity for course grades in airplane mechanics, 
baking, and bricklaying (U. S. Department of Labor, 1962), while 
the four EAS subtests were all highly correlated with grades in ma- 
chine shop, drafting, and electronics technology (Ruch and Ruch, 
1963). The GATB tests from B-1002 Form B consisted of Part 1, 
Name Comparison; Part 5, Tool Matching; and Part 7, Form 
Matching. The EAS subtests were: Test 2, Numerical Ability; 
Test 5, Space Visualization; Test 7, Verbal Reasoning; and Test 
10, Symbolic Reasoning. Each of these tests appeared in two ex- 
perimental batteries as follows: 

Experimental Battery A. Biographie Survey (20 minutes) and 
Name Comparison (10 minutes). 

Experimental Battery B. Verbal Reasoning (10 minutes), 
Numerical Ability (10 minutes), and Symbolie Reasoning (10 
minutes). 

Experimental Battery C. Tool Matching (10 minutes), Name 
Comparison (10 minutes), and Space Visualization (10 min- 

utes). 
Experimental Battery D. Biographic Survey (20 minutes) and 
Numerical Ability (10 minutes). 

Experimental Battery E. Form Matching (10 minutes), Tool 

Matching (10 minutes), and Verbal Reasoning (10 minutes). 


956 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Experimental Battery F. Form Matching (10 minutes), Space 
Visualization 10 minutes), and Symbolic Reasoning (10 min- 
utes). 

Procedure. Intercorrelations among all variables were computed 
and formed the basis for a series of sequential predictor selections 
(Horst and Smith, 1950) utilizing the WPC tests as predictors and 
each of the experimental subtests in turn as a criterion. Predictors 
were selected until the shrunken multiple correlation dropped. 
Regression constants and the reliability coefficients for all tests 
were used to calculate the uniqueness of each of the experimental 
tests (Horst, 1966, p. 330). 

Results. Table 1 presents the correlations of the established WPC 
battery elements with the experimental tests, and Table 2 pre- 
sents the results of the predictor selections, one for each experimen- 
tal test, to determine the extent to which the variance in each could 
be accounted for by WPC tests. The square of the multiple cor- 
relation (R2) indicates the proportion of variance of the experi- 
mental tests accounted for by the WPC tests, byt R? tends to 
overestimate the uniqueness of the experimental test because it 
ignores test unreliability. Horst’s uniqueness statistic (1966) also 
reported in Table 2 takes reliabilities into account in estimating 
the per cent of reliable variance in each experimental test not 
accounted for by the traditional battery. Despite the known in- 
dividual validities of all these subtests, the most important initial 
consideration is whether they measure something unique to the 
established battery. 

Overall, GATB subtests were superior to EAS subtests, EAS 2, 
numerical ability, which measures skill in addition, subtraction, 
multiplication and division, was essentially accounted for by the 
WPO tests with only 18 per cent unique variance. The mean of 
39.1 was consistent with those for college males (45.2) and college 
females (382) reported by Ruch and Ruch (1963). Nonetheless, 
EAS 2 could be a potentially useful addition to the established bat- 
tery not as a complement but as a replacement for, for example, 
WPC applied mathematics (simple arithmetic and algebra) if EAS 
2 were more desirable on other grounds such as requiring less test- 
ing time or yielding a better distribution of scores, 

The remaining EAS tests, although less predictable than nu- 
merical ability, still had close to 50 per cent of their reliable 
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variance determined by the current battery and had substantial 
correlations with one or more of the WPC tests. Any further con- 
sideration of these three experimental tests would require modi- 
fying the current battery. Thus, for example, the high correlation 
of EAS space visualization with WPC spatial ability indicates EAS 
5 does not represent a different space factor, and including both 
space tests in the battery would be a waste of testing time. 

The most promising experimental tests for extending the factor- 
ial complexity of the established battery were GATB tool matching 
and form matching with 88 per cent and 83 per cent uniqueness re- 
spectively. GATB 1, name comparison, on the other hand was no 
more unique than the EAS tests. GATB tool and form matching 
were also relatively independent of each other (r = .30). 

Summary. 'The impetus for this investigation was the need to ex- 
pand a general academic guidance battery established for university 
students so that it would be of service to community college students 
as well. Community college curricula not only expand the criteria 
of concern but also increase their diversity. Effective prediction 
requires that this diversity of criteria be matched by diversity of 
predictors. As this study illustrated, the search for new predictors 
in this situation can be effectively divided into two stages. Unique- 
ness of potential predictors is first evaluated. Thus, criterion data 
need only be accumulated for those measures which pass this first 
test of uniqueness, and those potential predictors which fail the 
test can immediately be replaced in experimental testing. 

Determining the uniqueness of a test requires only a fraction of 
the sample needed to establish the validity of that test for a wide 
range of criteria. Thus, quite efficient use of available testing time 
can be made by subdividing the testing sample into matched sub- 
groups. Because determining uniqueness requires only that each 
potential predictor be administered with the established battery, 
experimental tests can be quite different in the several subgroups. 
While the number of such groups will depend primarily on sample 
size, a number of experimental predictors can always be evaluated 
simultaneously. 

In the present study uniqueness clearly differentiated two em- 
ployment aptitude tests, GATB tool and form matching, for which 
criterion data are worth accumulating based on the potential new 
Variance these tests bring to the established battery. Future ex- 
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perimental testing will continue to assess the uniqueness of re- 
maining potential additions to the battery. 
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SOCIAL INTELLIGENCE AND ACADEMIC SUCCESS 


MARY Г. TENOPYR+ 
North American Rockwell Corporation 
El Segundo, California | 


Tue purposes of the investigator were twofold, (a) to obtain 
further information relative to the construct validity of some of 
O'Sullivan, Guilford and de Mille's (1965) tests of behavioral cog- 
nition or social intelligence, and (b) to ascertain whether grades in 
Selected ninth-grade subjects could be predicted more effectively 
were these tests employed in addition to the School and College 
Ability Tests (SCAT) (Educational Testing Service, 1957а) and 
some of the Sequential Tests of Educational Progress (STEP) (Ed- 
ucational Testing Service, 1957b) already utilized by a suburban 
high school in a large metropolitan area. 

Method. Subjects (Ss) were 115 male and 151 female tenth 
graders who were selected primarily on the basis of availability of 
SCAT-STEP scores which were in the school’s records only for 
students who had attended a school in the city the previous year. 
The Ss’ standard deviations for the SCAT-STEP series were ap- 
proximately equal to those reported for urban ninth graders (Edu- 
cational Testing Service, 1963); however their means for these 
tests were generally about .5s higher than those of the appropriate 
norm group. 

Dependent variables comprised (a) freshman grade-point aver- 


1The author is indebted for administrative assistance provided by the 
Aptitudes Research Project under the direction of Professor J. P. Guilford of 
the University of Southern California. Deep appreciation is expressed for sup- 
Port of the study by Mr. Francis Laird, Dr. Webster Wilson, and Mr. Robert 
Brown of the Fullerton, California schools. Computational assistance was 
Provided by the Western Data Processing Center. 
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age (GPA), (b) world history grade, available for 219 Ss as- 
signed to the course on the basis of an achievement test, and (e) 
grade in the regular ninth-grade English class, to which 173 Ss had 
been assigned as a result of teachers’ ratings. 

Each of the five behavioral cognition tests employed as an in- 
dependent variable was developed on the basis of the structure- 
of-intellect (SI) model (Guilford and Hoepfner, 1966). Each of 
these tests was found by O'Sullivan, Guilford, and de Mille (1965) 
to measure a separate respective ability, distinct from those in the 
verbal intellectual areas. In a second factor analysis involving these 
tests (Tenopyr, Guilford, and Hoepfner, 1966) they were found to 
define factors other than those in the symbolic-cognition and the 
symbolie-memory areas. Further information regarding the con- 
struct validity of the behavioral cognition tests is necessary before 
they can be applied in practical situations. Should they be found 
to correlate highly with academic success, there would be some 
reason to believe that they represent constructs in the intellectual 
area; however, should they not be strongly related to school 
achievement, there would be further support for the contention that 
a heretofore untapped area of behavioral activity is being meas- 
ured. The tests of social intelligence were as follows: Missing Car- 
toons, Picture Exchange, Picture Exclusion, Reflections, and Sil- 
houette Relations, In Tenopyr, Guilford, and Hoepfner’s (1966) 
analysis involving the present Ss these tests were found to be univo- 
cal measures of, in SI terms, cognition of behavioral systems, trans- 
formations, classes, implications, and relations, respectively. The 
internal consistency estimates of reliability for these tests ranged 
from .32 to .82. 

] The published tests used were SCAT—Verbal, SCAT—Quan- 
titative, STEP—Mathematics, STEP—Reading, and STEP— 
Writing. 

Nine multiple regression analyses, one for each of the dependent 
variables coupled with each of the following combinations of inde- 
pendent variables: all tests, behavioral cognition tests only, and 
SCAT-STEP series only, were accomplished. 

Results. Both intercorrelations and individual correlations with 
the criteria were lower for the behavioral cognition tests than for 
the SCAT-STEP series (Tables 1, 2, and 3). However, as is in- 
dicated in Table 4, the lower test intercorrelations for the be- 


| 


MARY L. TENOPYR 963 


+ TABLE 1 


Coefficients of Correlation, Means, and Standard Deviations 
for all Subjects (N = 266) 


Variable 123456789101 M 8 
1. Missing Cartoons 39 32 39 32 46 37 37 51 44 29 18.90 5.00 
2. Picture Exchange 26 24 20 31 31 33 33 20 27 9.56 2.47 
3. Picture Exclusion 16 22 34 25 21 35 35 33 12.08 3.16 
4. Reflections 09 38 34 31 42 38 23. 9.78 2.73 
5. Silhouette Relations 30 20 17 26 27 29 14.14 3.02 
6. SCAT— Verbal 55 58 81 68 54 280.10 14.80 
7. SCAT—Quantitative 66 59 51 62 298.10 14.35 
8. STEP—Mathematics 61 43 54 275.05 13.60 
9. STEP—Reading 70 61 293.20 16.05 
10. STEP—Writing 52 284.90 17.10 
11. GPA 2.43 .60 


Note.—Decimals omitted from correlations only. At .05 and .01 levels, significant r's are .12 
&nd .16, respectively. 


havioral tests did not effect levels of prediction commensurate with 
those afforded by the more highly related SCAT-STEP variables. 

All Rs computed were highly significant (p <.001). Although 
the increases in Rs afforded by adding, as predictors, the behavioral 
cognition tests to the SCAT-STEP series were small in magnitude, 
the increases were significant (p <.01) in predicting GPA and 
world history grade. 

Discussion, Although the social intelligence tests, in optimal 


TABLE 2 


Coefficients of Correlation, Means, and Standard Deviations for Subjects Having 
Taken Regular World History Course (N = 219) 


" Variable 12346678 91011 M 8 
1. Missing Cartoons 35 27 35 33 39 27 29 45 39 27 18.83 4.88 
2. Picture Exchange 27 22 14 23 25 28 26 23 21 9.53 2.40 
3. Picture Exclusion 13 19 32 18 15 30 35 33 12.72 3.16 
4. Reflections 1136 24 26 38 36 17 9.69 2.68 
5. Silhouette Relations 24 17 07 23 26 33 14.09 2.88 
6. SCAT— Verbal 47 50 77 64 51 278.70 13. 
7. SCAT—Quantitative 59 48 47 44 297.40 13.30 
8. STEP—Mathematics 50 36 41 274.60 12.40 
9. STEP—Reading 68 51 292.35 14.65 
10. STEP—Writing 50 283.80 15.25 
11. World History Grade 2.48 .90 


LLL ———À————————— 
Note.—Decimals omitted from correlations only. At .05 and .01 levels, significant r's are 
13 and .18, respectively. 
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TABLE 3 


Coefficients of Correlation, Means, and Standard Deviations for Subjects Having 
Taken Regular English Course (М = 173) 


Variable 123456789101 M 8 
1. Missing Cartoons 89 31 32 30 35 23 22 38 36 28 19.12 4.82 
2. Picture Exchange 3130 17 23 18 31 24 19 14 9.52 2.24 
3. Picture Exclusion 19 20 43 18 16 36 39 32 12.58 3.20 
4. Reflections 06 32 23 21 35 32 10 9.98 2.62 
5. Silhouette Relations 14 12 02 15 21 29 14.14 2.85 
6. SCAT—Verbal 41 43 74 61 45 280.40 12.75 
7. SCAT—Quantitative 54 41 36 41 298.15 13.15 
8. STEP—Mathematics 45 27 28 276.05 11.45 
9. STEP—Reading 60 50 294.30 13.95 
10. STEP—Writing 52 285.20 15.40 
11. English Grade 2.34 .74 


Note.—Decimals omitted from correlations only. At .05 and .01 levels, significant r's are .15 and 
-20, respectively, 


D 

combination, yielded a moderate level of prediction of academic 
success, the fact that they could add relatively little to the criterion- 
related variance provided by the SCAT-STEP series suggests that 
the primary sources of academic achievement-relevant variance in 
the behavioral tests were in already defined intellectual areas. A 
further inference is that substantial proportions of the variances of 
the social intelligence tests may be attributed to abilities other than 
those typically associated with intellectual achievement. Although 
this study has, by indicating to an extent what the behavioral cogni- 
tion tests do not measure, provided some further evidence relative 
to the construct validity of these tests, additional effort should be 
directed toward strengthening the empirical support for the con- 
tention that the social intelligence tests measure the constructs 
for which they were designed. 


TABLE 4 
Multiple Correlations for Tenth-Grade Students 


Criterion 
М Е =; 
World History English 


р СРА G Grade 
Independent Variable (N = 266) (N ES (N = 173) 
АП tests 
t E .63 
Behavioral-Cognition tests only oF vs 8 
SCAT-STEP series only 70 :60 -60 
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USEFULNESS OF PEER RATINGS OF PERSONALITY 
IN EDUCATIONAL RESEARCH 


GENE M. SMITH: 
Harvard Medical School and Boston University 


Fon more than 30 years researchers have sought to clarify the 
relationship beéween nonintellective factors and academic success. 
The problem is of basic theoretical and methodological signifleance 
and has national practical importance, but progress toward its sol- 
ution has been slow (Fishman and Pasanella, 1960; Gaier and 
White, 1965; Garrett, 1949; Harris, 1940; Michael, 1965; Stagner, 
1933; Travers, 1949). Most psychologists and educators would 
agree that poor motivation, faulty attitudes, and other nonintel- 
lective problems contribute importantly to academic failure. Yet 
evidence to support this agreement is elusive and based largely on 
consideration of individual cases. 

If the relationship between nonintellective factors and academic 
success is to be elucidated further, it appears that new approaches 
must be tried. The results presented here, and in a concurrently 
published report (Smith, 1968), indicate that peer ratings of per- 
sonality (an approach to personality assessment not often used by 
psychologists) can be remarkably helpful in clarifying the relations 
between personality and academic success. Another concurrently 
published report (Smith, 1967), dealing with an entirely different 
problem, gives further evidence of the usefulness of peer rating 


1 This study was supported in part by grants #NU00104 and USPH-M-987 
from the United States Public Health Service, in part by a grant from the 
Council for Tobacco Research-U.S.A., and in part by a grant from the Spauld- 
ing-Potter Charitable Trusts. Richard LaBrie, our computer programmer, 
contributed importantly to this work. Frederick Mosteller, Professor of 
Mathematical Statistics, Harvard University, and Arthur P. Dempster, Pro- 
fessor of Statistics, Harvard University, provided the author helpful guidance 
and this is gratefully acknowledged. 
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data for analysis and solution of theoretical and empirieal prob- 
lems. 

Although several studies (e.g., Astington, 1960; Carroll, 1952; 
Doll, 1963; Flyer, 1963; Flyer and Bigbee, 1954; Kleiger её al, 
1962; Tupes, 1957) have shown peer ratings of personality to have 
good reliability and predictive validity, this method of studying per- 
sonality has never been widely appreciated—perhaps, in part, be- 
cause of biases which can reduce the validity of rating data (Guil- 
ford, 1954; Secord, 1958; and Guilford et al., 1962) and, in part, 
because of the difficulty of achieving test conditions required for the 
valid use of the peer rating method. 

This paper (a) discusses the theoretical rationale underlying the 
peer rating method, (b) discusses test conditions needed to avoid 
methodological problems frequently encountered in use of rating 
data, (c) reports results of reliability and factor analytic studies 
Performed on peer rating data obtained in samples from college, 
nursing school, and high school and (d) presents results concern- 
ing relations between Personality and academic performance in a 
sample of 348 college students, 

The main justification for using the peer rating technique is its 
excellent analytic power. However, because the technique is not 
widely used, the logical basis for its power merits discussion. 
Human beings spend much time observing, evaluating, and drawing 
conclusions about other human beings. Even the most transient 
interaction between two individuals tends in influence to some 
degree, and at some level of consciousness, the feelings, opinions, 
and conclusions each has regarding the other. Such feelings and 
conclusions tend to become less tentative and more correct as 
additional relevant information is accumulated, checked, and 
tested for internal consistency. Validity tends to increase with 
the insight and astuteness of the observer and also as the inter- 
actions which give rise to mutual observations increase in frequency, 
duration, and environmental diversity. 

Assessment of personality by use of information accumulated by 
the individual's peers—information concerning his past reactions 
to various circumstances, situations, and people in his environ- 
ment—has three distinct advantages. (1) The information used is 
generated in the non-test; context of the individual's real-life en- 
vironment. (2) Tt is accumulated over long periods of time rather 
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than during one particular test period which might not have evoked 
representative information. (3) It is accumulated and stored by 
numerous observers with whom the individual has differing personal 
relationships, and who, consequently, view him from differing per- 
spectives. 

Method—The peer rating technique used in this study. The rater 
examines each of 42 bipolar personality traits and selects the five 
members of his peer group most like the left hand pole and the five 
most like its opposite on the right. (See example in Figure 1.) Se- 
lections on the left are considered positive nominations and those 
on the right are considered negative. The positive and negative 
nominations a ratee receives are scored +1 and —1, respectively. 
Failure to be nominated is scored zero. Thus, in a peer group of 
25, a ratee’s score on any trait can range from +24 to —24. 

Three featuses of this method should be emphasized: (1) For 
each trait a rater must nominate five peers for the left pole and five 
for the right; if he does not comply at least 39 times out of 42, his 
ratings are discarded. (2) Since a ratee is rated on each of 42 
traits by each of 20-30 peers, he receives about 1,000 separate 


SAMPLE CARD USED FOR PEER RATINGS OF PERSONALITY 


5. QUITTING: gives up before he has ys DETERMINED, PERSEVERING: sees а 
thoroughly finished a job; slipshod; works job through in spite of difficulties or temp- 
in fits and starts; easily distracted. led tations; strong-willed; painstaking and 
away from main purposes by stray .m- thorough; sticks at anything until he 


pulses or external difficulties. achieves his goal. 


@@crociaciacsactociactaciacr> CI2esacioci2ci2ctocioctoei cii] 


Cl/2attpcii»ciocit»citocirsalliciio ca. 
iia CD aib CK CO CX» CIO CIO CHOCHO 
Gli2ctociis cio cio coco ciociio ceo. 
MICRS CHD cuo cio cio cro cuo cio cio 


emctciocieecioci cibo cH c. 
CUD CED CHS CAD CED CAD CII CAD CAD aD 
CHICRICHICIOCHD CHICO CCCo 
CAD CROCUS CHO CHD CHIC CHD CHD CHD} 


RATING INSTRUCTIONS: 


1. On left, mark identification numbers of the 5 most QUITTING stu= 
dents in your groups 
On right, mark the identification numbers of the 5 most DETERMINED, 


2. 
PERSEVERING students in your groupe 
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ratings from peers with whom he has a high frequency, intensity, 
and duration of social interaction. (3) the 42 traits on which a 
peer is rated are the product of extensive work: Allport and Od- 
bert (1936) compiled all words in the English dictionary (1,800) 
which might be used to describe personality; Cattell (1957), by a 
series of logical and empirical distillations, reduced these traits to 
42 rating items. 

Guilford (1954) emphasizes the dangers inherent in rating tech- 
niques: Error can arise from (1) individual rater response tenden- 
cles (e.g., errors of leniency, neutrality, overseverity, contrast, and 
similarity), (2) the unique relationship between the rater and the 
ratee (e.g., errors of halo and irrelevance), (3) inadequate knowl- 
edge of the ratee by the rater, and (4) faulty construction of the 
items to be rated. 

Appropriate planning can reduce the effects of difficulties such 
as those just mentioned. The errors of leniency, neutrality, over- 
severity, contrast, and similarity, which arise when the rater can 
assign the various positions on a rating scale without restriction, 
reflect the rater’s response preference. Use of a forced choice pro- 
cedure allows the rater no response preference.? The rating instruc- 
tions, used to collect the data reported here, end by saying: 


“On each card you must fill in five numbers on the left and five on 
the right. This will be difficult in some cases. For instance, in the 
case of “adaptable” vs, "rigid", you might feel that most mem- 
bers of your section are adaptable and only two or three are 
rigid. You still have to fill in five numbers under "rigid," and 


you would do that by selecting the five you think are least 
adaptable,” 


Invalid ratings due to errors of halo and irrelevance are due to 
the rater's subjective bias. Unless many raters have the same bias, 
use of many raters should tend to cause these individual biases to 
cancel. (The present study of college students employed 20-30 
raters per group, a number which exceeds that usually found in 
studies employing rating scales.) 

ВИ СЕН 


2 Tt is important to note that for 3 i d 

4 р сей choice among ratees is conceptually an 
И T from forced choice Lobeck traits ее to be 
Bu ocial desirability, .-here is evidence that with the latter method, 
1951) € type under discussion can occur (Lovell and Haner, 1955; Travers 
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Rating data have been critieized as undependable because of in- 
sufficient knowledge of ratees by raters. This criticism usually arises 
when one or a few superiors rate subordinates whom they observe 
playing one role in one environmental setting. In the present study, 
the subjects were peers; there were many raters of each peer; and 
the subjects interacted frequently, over long periods, and under 
diverse environmental circumstances. 

Regarding the question of faulty construction of items to be 
rated, three points should be made. (a) As already stated, the de- 
rivation of the 42 items is based on considerable work by All- 
port and Odbert (1936) and by Cattell (1957). (b) The re- 
liability and predictive validity of these items is high. (c) Most of 
these items appear to meet the exacting standards specified by 
Guilford (1954) and originally published by Champney (1941): 
clarity, relevance, precision, variety, objectivity, and uniqueness. 
(See listing of 42 items; Cattell, 1957, pp. 813-817.) 

Length of acquaintance in the college sample. In studies where 
peer ratings of personality are collected after students in a peer 
group have interacted for a long period of time, it is possible for 
correlations between academic performance and peer ratings of 
personality to be biased by a tendency for academic performance 
to influence peer ratings. To reduce the effect of this potential bias 
in the study of the college sample reported here, peer ratings were 
collected before the 348 college students received results of their 
first midterm examinations. (At that point they had been together 
only nine weeks.) 

Other characteristics of the college sample. The 348 college 
students entered the College of Basic Studies of Boston University 
(CBS) in 1964.3 CBS is a two-year lower division program which 
employs team teaching and a core curriculum. An entering class 
of approximately 550 freshmen is divided into 20 sections of 25-30 
students each, four sections of which are assigned to a team of five 
instructors who represent the five divisions which make up the core 
curriculum of the College—Humanities, Science, Social Science, 


3235 of the 583 students in this class were excluded from analyses reported 
in Tables 2-4 because they lacked scores on the criterion (year 1 grade point 
average) or on one or more of the variables numbered 48-77 in Table 2. All 
583 had scores on the peer rating variables and thus were used in the factor 
analysis of data of college students shown in Table 1. 
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TABLE 2 


Univariate Correlation Between Each of ?? Predictor Variables and 
Year One Grade Point Average 


22. Good-Natured 
23. Insistently Orderly 
24, Tolerant of Stress 


58. Coop. Eng. Reading Level +.23 
59. Coop. Eng. Reading Speed +.11 
60. Coop. Eng. Expression T.14 


42 Original Peer Variables г — Five Composite Peer Variables r 
1. Adaptable +.12 43. Agreeable +.01 
2. Emotional +.01 44, Extraversion +.05 
3. Conscientious +.32 45. Strength of Character +.48 
4. Prone to Jealousy —.08 46. Emotionality +.01 
5. Quitting —.47 47. Refinement +.18 
6. Tender .00 
7. Self-Effacing -00- Two High School Performance 
8. Languid -. 

9. Assertive + +. 48. Rank +.19 
10. Attention-Seeking +. 49. Certified Units +.18 
11, Cool, Reserved =. 
12. Gregarious +. 13 Academic Aptitude Variables 
13. Demanding 5 B 
14. Quiet, Composed +. 50. SAT-Verbal +.22 
15. Cautious -. 51. SAT-Mathematical +.04 
16. Self-Reliant +. 52. DAT-Verbal Reasoning +.11 
17. Happy-Go-Lucky =. 53. DAT-Numerical Reasoning — .01 
18. Responsible +. 54. DAT-Abstract Reasoning +.03 
19. Tense - 55. DAT-Space Relations -00 
20. Trusting + 56. DAT-Mechanical Reasoning — .02 
21. Gay - 57. Coop. Eng. Vocabulary +.22 

+ 

+ 

+ 


ЗЕЕ 


25. Obstructive 61. STEP-Writing +.25 
26. Crude 62. Watson-Glazer +4142 
27. Easily Upset 
28. Sensitively Imaginative 15 Edwards Variables 
29. Esthetically Fastidious + 
30. Frank, Expressive + 63. Achievement +.03 
31. Talkative + 64. Deference pe 
32. Conventional + 65. Order —.05 
33. Considerate +: 66. Exhibition mde 
34. Socially Mature T. 67. Autonomy +.04 
35. Thoughtful, Original T 68. Affiliation —.06 
36. Resourceful +.24 69. Intraception 00 
38. Inquisitive ЖКП. Бш x 
39. Reluctant to Admit а ae 
Mistakes + 73. Nurturance mu 
40. Prone to Daydream —. 74. Change +.01 
41. Obedient + 75. Endurance +1 


42. Marked Interest in 


i —.02 
Opposite Sex Heterosexuality 
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Rhetoric, and Psychology and Guidance. CBS admits applicants 
who (because of poor high school records, shortages in prere- 
quisites, low academic aptitude scores, or some combination) are 
denied admission into four-year programs in Boston University. 
The CBS program is described in further detail by Anthony et al., 
1956; LaFauci, 1956; and LaFauci and Richter, 1965. 

The joint use of team system and core curriculum causes CBS 
students in sections of 25-30 students to have a frequency, inten- 
sity, and length of social interaction rarely found in other collegiate 
programs. Although these features make CBS a very suitable pop- 
ulation in which to use the peer method, many other academic set- 
tings also provide suitable samples; e.g., high schools where sec- 
tioning by curriculum, and by ability within curriculum, results in 
students in groups of 20-30 having many classes together. 

Results—Reliability of peer ratings of personality. Data collected 
at CBS, one high school, and two nursing schools, have permitted 
the study of split-half reliability of peer ratings on Cattell’s 42 bi- 
polar behavior traits. Figure 2 shows, for each of 61 samples, the 
median of the 42 split-half reliability coefficients (corrected with the 
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Spearman-Brown formula) plotted against the number of raters in 
that sample. (Variation among samples regarding number of raters 
was due partly to variation in Size of peer groups and partly to varia- 
tion among groups regarding absenteeism on the test day and regard- 
ing compliance with rating instructions.) Note that reliability varies 
as a function of number of raters, increasing as raters increase from 
six to about twenty. After twenty, increased reliability is slight. 
The median of the median reliability coefficients obtained with 
samples of 15 or more raters was 0.83; see Figure 2. 

Factor analytic studies. Factor analysis (principal components 
rotated to an oblique solution using the biquartimin criterion, 
Harman, 1960) of scores on the 42 traits typically yields five 
factors which have stable structure from sample to sample within 
the CBS population and which agree well with results obtained with - 
other populations the writer has studied (high school and nursing 
students) and with populations studied by others. The five factors 
isolated are similar to those reported by Tupes and Christal (1961) 
and by Norman (1963). The terms we use to designate the five fac- 
tors are “agreeableness,” “extraversion,” “strength of character,” 
“emotionality,” and "refinement." Table 1 shows the loadings ob- 
tained on each of these five factors in each of three large samples: 
583 college students, 521 nursing students, and 324 high school 
students. (These three samples contain 20, 22, and 14 peer groups, 
respectively; most peer groups consisted of 15-30 members.) The 
Peer variables are identified by number in Table 1 and by number 
and name in Table 2. The bottom of Table T shows that the rela- 
tive importance of the five factors, measured in terms of variance 
accounted for (ie, per cent of trace) is the same in all three 
samples, The factor analytic structure of the 42 peer variables, 
shown in Table 1, is psychologically meaningful and very similar for 
the three samples, 

Prediction of academic Success, In both univariate and multivari- 
ate studies of data obtained from the 348 college students, the pre- 
dictive validity of the peer variables belonging to the factor called 
“strength of character” was found to be superior to that of variables 
belonging to the other four peer factors and superior to that of 
the 30 nonpeer variables [i.e., 13 measures of academic aptitude, 
15 scores from the Edwards Personal Preference Schedule (EPPS) 
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(Edwards, 1954), and two measures of high school performance] 4 

In univariate studies, the peer variables which predicated year one 
grade point average best were: “quitting” (—.47), “inquisitive” 
(+.35), “conscientious” (+.32), “prone to daydream” (—.32), “re- 
sponsible” (+.29), “insistently orderly" (+.28), “self-reliant” 
(+.26), “languid” (—.26), “socially mature” (+.25), and “re- 
sourceful” (4.24). Nine of these 10 variables belong to the factor 
described as “strength of character.” A composite score for “strength 
of character” (based on all variables belonging to this factor) 
correlated +.43 with the criterion, (See Table 2.) 

Among the 18 academic aptitude measures, the STEP-Writing, 
Cooperative English Reading Level, Cooperative English Vocab- 
ulary, and SAT-Verbal gave the highest univariate correlations 
with year one grade point average (4-.25, --.23, +.22, and +.22, re- 
spectively). High school rank and number of certified units yielded 
correlation coefficients with the criterion of +.19 and +.18, re- 
spectively. (See Table 2.) - 

In multiple regression analysis (Rao, 1952) of the data of the 
348 college students, the contribution to predictive accuracy made 
by the peer data exceeded that of all other sources of data com- 
bined. Prediction of year one grade point average from a battery 
consisting of 42 peer and 30 non-peer variables was performed 
with a computer program in which variables were admitted to 
the prediction battery in a “step-wise” manner in the order of con- 
tribution to predictive accuracy. In the procedure used, the first 
variable to enter theyprediction equation was the one (out of 72) 
which had the highest correlation with the criterion. The next to 
enter was the one having the highest correlation with the criterion 
after the effect of the first predictor variable was partialled out, At 
the next and at each succeeding step the entering variable was the 
one having the highest correlation with the criterion when the in- 


4The 13 measures of academic aptitude were: the verbal and mathematical 
subtests of the College Entrance Examination Board Scholastic Aptitude Test, 
five subtests of the Differential Aptitudes Test. (verbal reasoning, numerical 
reasoning, abstract reasoning, mechanical reasoning, and space relations), four 
subtests of the Cooperative English Test (vocabulary, reading speed, reading 
level, and mechanics of expression), the writing subtest of the Sequential 
Tests of Educational Progress Series, and the Watson-Glaser Critical Thinking 
Appraisal (See Buros, 1959). The two high school measures were: high school 
rank corrected for class size [ie., 1.00-(rank/size)], and number of certified 


high school units. * 
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fluence of the multivariate prediction battery obtained at the pre- 
ceding step of the analysis was partialled out. 

In the results reported in Tables 3 and 4, the analyses were stopped 
after 10 variables* had entered the prediction battery. At the 
tenth step of the analysis reported in Table 3 the multiple correla- 
tion coefficient was .64 and the contribution of each of the 10 vari- 
ables was statistically significant. Six of these 10 were peer vari- 
ables, two were aptitude variables, and two were variables dealing 
with high school performance, The total contribution to the coef- 
ficient of determination (Е?) made by the four types of variables in 
this analysis were: peer 68%, aptitude 19%, high school 13%, and 
EPPS zero%. The peer variable called “quitting” (see Figure 1 for 
complete definition) contributed more to predictive accuracy than 
any of the other 71 predictor variables in this analysis, It accounted 
for 55 per cent of R?, As shown in Table 1, “quitting,” which is 
peer variable #5, belongs to the factor we call “strength of char- 
acter,” 

A parallel analysis was conducted by substituting five composite® 
Peer variables for the 42 original peer variables. Under those con- 
ditions the multiple correlation coefficient reached .60 at the tenth 
step of the analysis. The total contribution to R? was: peer 59%, 
aptitude 18%, high school 16%, and EPPS 6%. The composite peer 
variable we call “strength of character” contributed more to pre- 
dictive accuracy (It accounted for 52 per cent of R2.) than any of 
the other 34 predictor variables (See Table 4). 

1 Comments. The sample of students for whom results are reported 
in Tables 2, 3, and 4 is one of several samples studied at CBS and 
elsewhere with the Peer rating method of assessing personality. 
Results obtained with this particular sample are of special interest 
because in it the peer data were collected early in the acquaintance 
of the peers (just prior to report of midterm grades for the first 
Semester of the freshman year and hence six months before data 
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comprising the criterion—year one grade point average—were gen- 
erated) in an attempt to avoid bias due to rater knowledge of aca- 
demic performance of ratees. Both reliability and predictive valid- 
ity of the "short acquaintance" peer data were somewhat lower 
than when CBS peer data were collected after longer acquaintance 
(e.g, the median reliability of the short acquaintance peer data 
was .69, whereas that for CBS samples studied after one year of 
acquaintance is usually between .80 and .85) but the particular 
peer variables which had the highest predictive validity in the pres- 
ent analyses, (namely, those belonging to what we call “strength 
of character"; see Tables 1 and 2) are those which other unpub- 
lished analyses have shown to be most highly correlated with aca- 
demic success in CBS samples studied after longer acquaintance. 

The CBS population is unique in the respects mentioned in the 
last two paragraphs of Method. In addition, the univariate cor- 
relations between the criterion (year one grade point average) 
and typieally useful predictors such as high school performance 
and academic aptitude scores were somewhat lower than is usually 
found in the literature. Nevertheless, there are reasons for believing 
the relationships found in this study are not exclusively indigenous 
to CBS. In concurrently published studies of students from nursing 
school and high school (Smith, 1968), the results obtained with 
peer data are similar to those obtained in the CBS study. Further 
work is needed to establish the generality of these findings, but it 
should be mentioned that peer rating measures of perseverance, 
conscientiousness, inquisitiveness, responsibility, self-reliance, and 
orderliness found to be related to academic success in the CBS 
population have also been found to be related to academic success 
in analyses of data from other populations that the writer has stud- 
ied. It is of particular interest that all of these traits belong to the 
same factor, called here “strength of character.” 

Summary and conclusions. The relationship between personality 
and academic success, though recognized as important, is not well 
understood. This paper presents results indicating that peer ratings 
of personality (an approach to personality assessment not often 
used by psychologists) can be remarkably helpful in clarifying this 
elusive relationship. Data reported here were contributed by 1,428 
students from college, high school, and nursing school. Results are 
presented concerning the reliability and factor analytic structure 


982 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of peer data obtained in all three populations studied, and con- 
cerning the predictive validity of peer and non-peer data obtained 
in one sample of 348 college students. In addition, the paper 
discusses the theoretical rationale underlying our use of the peer 
rating method of personality assessment and discusses ways of 
avoiding methodological problems frequently encountered in use of 
rating data. 

Three conclusions are drawn: (1) Peer ratings of personality, 
properly elicited and evaluated, can provide information of high 
reliability and predictive validity. (2) The factor analytic structure 
of the 42 personality variables studied with the peer rating tech- 
nique is highly stable from sample to sample within and across pop- 
ulations. (3) The peer variables belonging to the factor called 
here “strength of character” are important nonintellective corre- 
lates of academic success, » 
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COMPARATIVE VALIDATION OF ATTITUDE 
MEASURES BY THE MULTITRAIT-MULTIMETHOD 
MATRIX 


JACK M. HICKS 
University of Missouri at Kansas City? 


Tux main purpose of the present investigation was to compare а 
direct with arfindirect measure of attitude within the framework of 
the Campbell and Fiske (1959) multitrait-multimethod matrix. 
The prototypic adaptation of the multitrait-multimethod matrix for 
this purpose is provided by Maher, Watt, and Campbell (1960). 
In this study, attitudes toward “law and justice” and attitudes to- 
ward “home and family” were each measured by sentence comple- 
tion stems (indirect method), and by the orthodox Likert method 
of summated ratings. However, a crucial limitation of the Maher, et 
al. study, as recognized by the authors, was the absence of a third 
method against which the relative validities of the methods in ques- 
tion could be checked. The present investigation provides such a 
third method in the form of the Sematic Differential (Osgood, Suci, 
and Tannenbaum, 1957). Another novel feature of the present 
study was the choice of an objective judgment “task” as the mode 
of indirect attitude assessment. Objective judgments of the favor- 
ability or unfavorability of attitude statements has been receiving 
increasing support as a useful indirect technique of attitude mea- 
surement (e.g, Zavalloni and Cook, 1965; Selltiz, Edrich, and 
Cook, 1965) since its initial recommendation by Hovland and 
Sherif (1952). The Likert (1932) technique served as the direct 
measure, in which case the only difference between the two meth- 
ods was instructional set. 


1 This study was completed while the author was at Wake Forest College, 
and was supported in part by the Wake Forest College research and publica- 
tion fund. 
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The basic hypotheses of the multitrait-multimethod matrix in- 
volve evaluations of relative and absolute magnitudes of given types 
of correlations. First, the highest correlations in the matrix are 
expected among monoattitude-monomethod values (reliability coef- 
ficients). Direct comparisons of the reliabilities of direct and in- 
direct measures can, thus, be readily made. The second highest cor- 
relations in the matrix should be the monoattitude-heteromethod 
(convergent validity) coefficients. Validity values corresponding 
to all three methods are expected to be highly significant statistic- 
ally, and substantially higher than corresponding heteroattitude- 
heteromethod values. Of special interest in this context is (1) in- 
spection of the relative convergent validity levels of the direct and 
indirect methods with the Semantic Differential, and (2) a com- 
parison of the respective monoattitude-heteromethod values relative 
to corresponding heteroattitude-heteromethod values. In order to 
compare direct and indirect measures, particular attention should 
be given to their respective heteromethod correlations associated 
with the Semantic Differential. The same holds true for a fourth 
basis for comparison; viz., the relative degree to which respective 
monoattitude correlations with the Semantic Differential exceed the 
relevant, heteroattitude-monomethod values. A fifth comparison 
can be made by examining the monomethod values for each of the 
methods relative to corresponding heteromethod values. The larger 
the monomethod correlations relative to the heteromethod values, 
the greater the evidence of common-method variance. 

A sixth way of comparing methods is by showing the degree to 
which each is correlated with & measure of social desirability bias. 
As a general case, such scores should correlate higher with direct 
test scores than with indirect test scores. Direct tests presumably 
allow for greater manipulation of relevant test responses than do 
indirect tests, and in addition, high social desirability scorers should 
be among those most tempted to manipulate their responses—in a 
socially acceptable direction. Another important consideration 
may be the extent to which the attitude referents (and other aspects 
of the testing situation) generate social desirability incentive. At- 
titudes toward Negroes, the Peace Corps, and Journalism were 
chosen as topics as likely to decrease respectively in this regard. 
Correlations between direct and indirect methods, should, there- 
fore, vary as these attitudes vary, in that the methods should be dif- 
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ferentially susceptible to social desirability bias. Highest validities 
were expected when the attitudinal topic was journalism, with 
successively lower correlations for attitudes toward the Peace 
Corps and Negroes respectively. 

Method—Respondents. One hundred and nineteen introductory 
psychology students (81 males, 38 females) at Wake Forest Col- 
lege, participated in partial fulfillment of a course research partici- 
pation requirement. 

Materials. Stimulus materials consisted of a measure of social 
desirability, and measures of attitudes toward Negroes, the Peace 
Corps, and the field of journalism. Social desirability was measured 
by Edwards’ (1957) 39-item scale excerpted from the MMPI. Each 
of the attitudes was measured by three methods: (1) a 6-point 
Likert scale (direct method); (2) the Semantic Differential (SD) ; 
and (3) objective judgments (indirect method). The Likert meas- 
ure of attitudes toward Negroes consisted of 45 statements, 25 of 
which came from Hinkley’s (1932) scale of 114 statements. Items 
were pretested on 55 advanced undergraduate psychology stu- 
dents, The final form was reduced to 45 items with significant item- 
total correlations, and which also yielded 80 per cent agreement 
among 10 judges (mostly psychology staff colleagues) making in- 
dependent judgments of the directionality of the statements, The 
Likert measures of attitude toward the Peace Corps and journalism 
consisted of sets of 30 and 20 items respectively, also treated in 
the above manner, and originating with this investigation. Com- 
parable numbers of favorably and unfavorably worded statements 
were present on the three inventories as a control against acquies- 
cence bias. 

All three attitudes were also measured by a 14 polar-adjective 
Semantic Differential. Eight of the adjectives represented the evalu- 
ative factor, with the remaining six divided equally among the ac- 
tivity and potency dimensions. Semantic Differential forms were 
headed by the key words Negro, Peace Corps, and Journalism. 

The third method of measuring attitudes was by an objective 
judgment technique. This method was identical to the Likert tech- 
nique except for instructions to make objective judgments, as fol- 
lows: 

In the following tasks, you are to report as objectively as possi- 
ble your judgments regarding the degree of favorability or unfavor- 
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ability of each statement toward certain groups and institutions. 
For example, one set of statements pertains to Negroes as a group; 
another set of statements pertains to the Peace Corps, etc. To the 
left of each statement is a space provided for you to numerically 
rate that statement according to the following scale: 


Vs. cost ОЕ Е. 60.5; и. ео. 
Ап unfavorable Neither a favorable A favorable 
statement, nor an unfavorable statement 
Statement 


Thus, if in your judgment, a statement is “unfavorable” toward 
the group involved, you should place a number in the neighborhood 
of “40” on the line to the left. If you judge a statement to be “fa- 
vorable", then a *60" would be more appropriate, Feel free to use 
numbers above, below, and in between those indicated. The above 
scale is intended only as a guide. For example, if you have previ- 
ously made a rating of *60" and come upon a statement, which you 
judge to be even more favorable, then a rating somewhat above 
"60" should be used. Remember, we want your OBJECTIVE 
JUDGMENT about each statement as it stands. 

Basically the same instructions were used successfully for a 
variety of judgmental materials in a previous study (Hicks and 
Campbell, 1965). 

Design. Comprehensive self-administering booklets were pre- 
pared, in which each of the three attitudes was measured by each 
of the three methods. The Edwards Social Desirability Scale was 
also included in each booklet, Attitudes were grouped by method 
of measurement. That is, all three attitudes were presented by a 
single method before further consideration of the same attitudes 
was made by a different method. The three methods were rotated 

follows: judgment—SD—Likert; Likert—judgment—SD; and 
SD—Likert—judgment. The Social Desirability Scale was placed 
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for each of three methods. Thus, the three attitudinal orders were 
matched with each of the three method arrangements, producing 
nine (3 х 3) types of booklets. It was not considered necessary 
to vary within-task item orders, in that extensive precautions taken 
by Hieks and Campbell (1965) showed a complete absence of 
item order effects. 

Procedure. Questionnaire booklets were prearranged for dis- 
tribution in sets of the nine booklet types discussed above. Re- 
spondents were tested in aggregates of approximately a dozen per 
session. Introductory front page instructions appearing on all 
booklets were as follows: 

On the pages to follow, you are asked to respond to a variety of 
words and statements according to certain instructions. Proceed 
through the items rapidly, but conscientiously. Refrain from turn- 
ing back to items already answered, or ahead to items not yet an- 
swered. If you have a question, quietly bring it to the attention of 
the monitor, 

There are no “right” or “wrong” answers to these items, It is es- 
sential that you give your own genuine reaction to each item as it 
stands. Do not attempt to “read into” an item anything not ex- 
plicitly stated. 

The data which you provide will be used for statistical purposes 
only, and are strictly confidential. Proceed. 

Respondents worked in familiar classrooms on nonclass time, 
with student resarch assistants presiding. Respondents were al- 
lowed to leave the room quietly upon completing the booklet, while 
others continued to work. 

Inventory scoring. The usual method of summated ratings was 
used for all three methods. Only the eight evaluative dimension ad- 
jectives were used in the scoring of Semantic Differentials, and 
according to the usual procedure, the graphic scales were divided 
into seven score categories. For objective judgments inventories the 
numerical rating of “50” was chosen as а pivotal point in the con- 
version of ratings made for negatively worded statements. Thus, a 
rating of “40” for negatively worded statements was converted to 
“60” in order to reflect the respondents attitude. Likewise, 41 was 
changed to 59, 42 to 58, 43 to 57 . . ., 55 to 45, 56 to 44 . . ., ete. 
Ratings of “50” and ratings for all positively worded statements 


were left unchanged. 
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Results and discussion —Reliability analysis. Internal consistency 
reliability estimates were computed for all measures using the Spear- 
man-Brown correction formula recommended by McNemar (1962, 
p. 150). As shown in Table 1 and Figure 1, resulting split-half co- 
efficients tended to be high, tanging from .77 to .95. The reli- 
ability of .77 for the Edwards Social Desirability Scale is compar- 
able to the .83 value reported by Edwards (1957, p. 31). The 
three methods of attitude measurement yielded reliabilities ranging 
from .82 to .95, with those involving the topic of journalism tending 
to be the lowest, and Negro reliabilities being generally the highest. 
"This result appears directly to reflect differences in total numbers of 
Statements making up each measure (20, 30, and 45 items for 
Journalism, Peace Corps, and Negroes, respectively). That is, the 
more items in the scale, the higher its reliability—a typical finding. 
The SD, in which the same eight evaluative polar-adjectives were 
used for all three attitudes, yielded more homogeneous reliabilities. 
These values, ranging from .86 to :90, are comparable to the test- 
retest stability values of .87 to .93 reported by Osgood, et al. (1957, 
p. 192). In full accord with expectation, the 10 reliability values 
are the 10 highest correlations in the matrix. These values are suf- 
ficiently high, in fact, that corrections for attenuation produce only 
very minor increases in the remaining values in the matrix. A com- 
parison of internal consistencies for direct and indirect methods re- 
flects favorably upon both, with only a slight difference in ranges 
of these values being noted (.82 to .95 for Likert and .84 to .92 
for Judgments). 

Convergent validity analysis. Basic hypotheses concerning mono- 
attitude heteromethod Coefficients are strongly supported. As indi- 
cated in Figure 1, eight of these nine values reach a very high 
level of statistical significance (p<.001), and, with the exception 
of judgments-SD values, tend to be higher than validity coefficients 
typically reported (differential dispersions not accounted for). 

A comparison of the two independent sets of monoattitude cor- 
relations with the SD, show the Likert values to 
higher than objective jud, 
equal, the Likert techni 
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evaluative attitudinal responses. Thus, correlations between judg- 
mental and direct measures should decrease to the extent that ob- 
jective judgments actually prevail, overriding attitudinal responses. 
This effect evidently operated to some degree not only by virtue of 
the relatively low convergent validity correlations between judg- 
mental and SD scores, but also by the fact that the judgments-Likert 
validities are markedly lower than the corresponding sets of re- 
liability coefficients for these two instruments. However, substantial 
influence of attitude upon objective judgment—a necessity for judg- 
ments to serve as indices of attitude—is evidenced by the fact that 
judgments scores correlate with SD scores as high as they do, and 
also by the even higher monoattitude correlations between judg- 
ments and Likert scores. 

Comparison of convergent and discriminant validities. In order 
more adequately to evaluate the above convergent validities, and 
other criteria of test adequacy, it is necessary to consider the rela- 
tive levels of corresponding discriminant validities. A basic de- 
sideratum of the multitrait-multimethod matrix is that convergent 
validity values must exceed the heterotrait-heteromethod values 
lying in their respective columns and rows of the same hetero- 
method block. This requirement is convincingly met in all except 
one instance (value of .06 for journalism). Even here the hetero- 
attitude values are uniformly negligible, and exceed the monoat- 
titude value of .06 only slightly in two of the four instances, one 
of which is negative. 

A further and somewhat more demanding requirement of the mul- 
titrait-multimethod matrix is that each convergent validity exceed 
those heterotrait-monomethod values with which it has one trait 
and one method in common. Again, the record of the convergent 
validity values is impressive with uniformity being violated in only 
five of the 36 possible comparisons. Thus, a substantially greater 
amount of common-trait than common-method variance is indi- 
cated. 

A comparison of Likert and judgments measures shows that the 
judgments method is involved in all five instances in which mono- 
method exceed monoattitude values. The one instance involving 
the Likert method also involves the judgments method. Thus, again, 
the direct method has generally the better record. 

Discriminant validity comparison. Though the evidence strongly 
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indicates greater common-trait than common-method variance, there 
is nonetheless some evidence of the latter to consider. Comparison 
of heteroattitude-monomethod to heteroattitude-heteromethod val- 
ues shows a tendency for the former to be higher. As indicated in 
Figure 1, five of the nine monomethod values are significant 
(p «.05), whereas only two of the 18 heteromethod values reach a 
similar level. 

A comparison of the Likert and judgments methods again shows 
the direct method to have the edge. 

Social desirability analysis. Hypotheses based upon the influence 
of social desirability bias were uniformly unsupported. The predic- 
tion that correlations between direct and indirect methods would 
decrease as pressure for social desirability bias increased, did not 
materialize. Nor does it appear that there is any tendency for the 
attitudinal referents to elicit differing amounts of social desirability 
bias, as indicated by their uniformly negligible correlations with 
social desirability scores. In this regard, it seems plausible that the 
testing situation itself was perceived by the respondents as suf- 
ficiently innocuous as to provide no basis for motivation to distort 
responses in a socially desirable direction. On the other hand, sus- 
picion can be cast upon the validity of the Edwards SD scale. 

Conclusion. The basic correlational data of this investigation 
appear very adequate, with all intrinsic multitrait-multimethod 
matrix hypotheses being readily confirmed. Reliabilities of the 
measures were all very high, and evidence of high convergent and 
discriminant validity was present. Various comparisons of the 
Likert and judgments measures showed the former to be generally 
superior. Though reliabilities were comparable, the Likert method 
yielded slightly less methods variance and considerably higher con- 
vergent validities, both in terms of absolute magnitudes and rela- 
tive to discriminant values, 

Thus, it is evident that under conditions of the present experi- 
ment, the direct method was superior to the indirect method. How- 
ever, a comprehensive comparison may not have obtained, because 
of either the failure of social desirability pressure to materialize— 
a condition for which indirect measures are especially intended— 
or, failure of the Edwards SD scale in reflecting relevant individual 
cur Furthermore, the objective judgments method may serve 
asa weakened” index of attitude even when social desirability in- 
centive is present, in that judgments instructions are likely at least 
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partially to inhibit attitudinal responses. The extent to which this is 
the case may depend upon the degree of ego-involvement of the 
respondents (Ward, 1965), the judgment method (eg., Kelley, 
Hovland, Schwartz, and Abelson, 1955), and perhaps the “thor- 
oughness” of the objective judgments instructions. The objective 
judgments method may be advantageous, however, whenever the 
relative social desirability incentive intrinsic to the situation ex- 
ceeds the relative inhibiting effects of judgments instructions. 
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A COMPARISON OF TEST PERFORMANCE ON THREE 
ANSWER SHEET FORMATS! 


PRISCILLA HAYWARD 
Rutgers, The State University? 


Tuis study was designed to investigate possible differences in test 
performance asta consequence of the kind of answer sheet used by 
the student. Because an increasing variety of machine-scorable 
answer sheet formats are now supplied for use with multiple-choice 
tests, the relevance of norms based on a standardization using an 
older answer sheet format has been questioned. 

Previous investigations have compared performance on the older 
IBM 805 format to that on some other format. Studies involving 
the IBM 805 and IBM 1230 answer sheets (Miller and Minor, 1963; 
Miller, 1965; Dizney, Merrifield, and Davis, 1966) have shown 
that no differences in mean scores on unspeeded tests exist above 
the eighth grade level but that they may exist at the fourth grade 
level, that the color of the answer sheet, makes no difference, that 
females tend to score higher than males on either type of answer 
sheet, and that subjective"feelings of college students and adults 
indicate a preference for the IBM 805 answer sheet over the IBM 
1230. 

Mean scores of sixth graders on the IBM 805 answer sheet and 
the IBM mark-sense answer card showed no differences in a study 
by Slater (1964). Merwin (1963) compared Differential Aptitude 
Test scores of ninth grade students on IBM 805 and MRC answer 
sheets and found differences between the formats on the speeded 
Clerical Speed and Accuracy subtest but not on the unspeeded 

1 This article is based upon a thesis submitted to the Graduate School of 
Education of Rutgers, The State University, in partial fulfillment of the 


requirements for the degree of Master of Education. 
2 Now at New York State Education Department. 
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subtests. The MRC Clerical results for males were significantly 
lower than the IBM 805 results. 

The present study employed three answer sheets simultancously 
(the IBM 805, IBM 1230, and the Digitek) and took a further look 
at the fourth grade level, as well as the eighth grade level, and at 
possible sex differences, An unspeeded reading test, which had 
been standardized on the IBM 805 format, was used. 

Subjects. The sample consisted of 125 fourth graders and 134 
eighth graders from a public school system in central New Ji ersey. 
Each group comprised the entire grade level at two schools. Both 
groups had previous experience with the MRC format, and the 
eighth grade group had also used the IBM 805 format one year 
earlier, 

Measuring instruments, The Reading Test of the Sequential Tests 
of Educational Progress (STEP) was used with EBM 805, IBM 
1230, and Digitek answer sheets in the present study. Form 4A 
was administered to the fourth grade group, and Form 3A was 
used with the eighth grade group. 

In order to adjust for differences in verbal ability between girls 
and boys, appropriate levels of the Academic Ability Test (AAT) 
of the Ohio Survey Tests were also administered. SCRIBE answer 
sheets were used. 

Procedures. А+ each grade level the sample was stratified by sex 
and by homeroom, and within each stratification the students were 
randomly divided into thirds to receive either an IBM 805 answer 
sheet, an IBM 1230 answer sheet, or a Digitek answer sheet for the 
STEP Reading Test. 

The fourth grade sample took the STEP and AAT tests one week 
apart, whereas the eighth grade sample took both tests on a single 
day. All testing took place in January, 1966. 

Mean STEP raw scores were computed for each answer sheet 
format and sex within each grade level. Differences among the re- 
sulting means, after being adjusted for any differences in verbal 
ability as measured by the AAT Verbal subtest, were studied by 
means of a two-way analysis of covariance (Steel and Torrie, 1960, 
D. 321-322). Pairs of means within each grade sample were also 
compared by ё tests (Ferguson, 1959, р. 238). 

! Results. Although 125 and 134 students in grades four and 
eight, respectively, participated in the study, data were incomplete 
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_ TABLE 1 
Adjusted STEP Reading Means, Grade 4 


_ Answer Sheets 
Sex ІВМ 805 IBM 1230 Digitek 
Male 42.02 45.24 50.06 
Female 46.31 45.33 45.87 


for some subjects. Additional cases were discarded randomly be- 
cause of statistical considerations, Hence, the results reported here 
are based on 108 fourth graders and 126 eighth graders. 

The lack of speededness for the STEP Reading Test was verified, 
for, on the average, the fourth grade group completed 96.3 per cent 
of the test, and the eighth grade group completed 98.7 per cent. 

The adjusted STEP Reading mean raw scores for grade 4 are 
presented in Table 1. The analysis of covariance is summarized in 
Table 2. 

The F ratio for sex, which is not significant, indicates that no sex 
differences existed on the average at the fourth grade level, when 
the influence of verbal ability had been removed. The F ratio for 
answer sheet format, however, was significant at the .05 level, 
indicating that the type of answer sheet format used did affect the 
scores, Furthermore, the large F ratio for Sex X Answer Sheet in- 
teraction, significant beyond the .01 level, indicates that the type 
of answer sheet format affected the scores in different ways for the 
different sexes. The answer sheets are ranked from easiest (highest 
mean) to most difficult (lowest mean) in different ways. For fourth 
grade boys the IBM 805 answer sheet was the most difficult and the 
Digitek easiest, with the IBM 1230 of medium difficulty; for the 
girls, the three formats were of almost equal difficulty. Although 


TABLE 2 
Analysis of Covariance of Grade 4 STEP Reading Scores 
Adjusted 

Source of Variance df MS F 
Sex (8) 1 29.48 1.71 
Answer Sheet (AS) 2 73.07 4.28* 
ВХ AS 2 4,802.86 282.40** 
Error 101 17.22 
Total 106 
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TABLE 3 
Adjusted STEP Reading Means, Grade 8 


Answer Sheets 
Sex IBM 805 IBM 1230 Digitek 
Male 42.17 42.17 41.38 
Female 45.82 44.16 45.97 


limitations exist when £ tests are applied in multiple comparisons, it 
was thought nevertheless that some indication of possible signifi- 
cance in differences between pairs of means might be informative, 
despite the risk of type I errors in excess of .05. When applied, 
the ¢ tests showed that the difference between the male IBM 1230 
and IBM 805 means is not significant, but the male Digitek mean is 
significantly higher (easier) than either of the two IBM means 
(p<.001). Comparisons of means between the sexés showed sig- 
nificant differences at the .01 level between their performance on 
the IBM 805 and Digitek answer sheets. 

Adjusted STEP Reading mean raw scores for grade eight are 
presented in Table 3. The analysis of covariance is summarized 
in Table 4. 

The F ratio for sex demonstrates that there was a large differ- 
ence, significant beyond the .01 level, attributable to sex at the 
eighth grade level, with the girls scoring significantly higher than 
the boys. The F ratio for answer sheets is significant at the .05 level 
and indicates that the type of answer sheet format used affected the 
scores. The F ratio for Sex X Answer Sheet interaction, significant 
beyond the .01 level, shows that the type of answer sheet format 
affected the scores in different ways for the two sexes. Ranking from 


TABLE 4 
Analysis of Covariance of Grade 8 STEP Reading Scores 
Adjusted 

Source of Variance df MS ^g 
Sex (8) 1 453.04 94.19** 
Answer Sheet (AS) 2 20.33 4.23* 
S x AS 2 32.12 6.68** 
Error 119 4.81 


Total 124 
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most to least difficult are the Digitek, IBM 805, and IBM 1230 
for boys; for the girls the order is IBM 1230, IBM 805, and Digi- 
tek. The £ tests indicate that the rankings for the boys do not rep- 
resent significant differences, but for the girls the rankings rep- 
resent differences between the IBM 1230 and IBM 805 (p<.05) 
and between the IBM 1230 and Digitek (p<.01) but not between 
the IBM 805 and the Digitek answer sheets. The ¢ tests comparing 
any of the female means with any of the male means reach signifi- 
cance beyond the .01 level, with the performance of the girls higher 
than that of the boys in all cases. 

Discussion. It appears that answer sheet format makes a dif- 
ference in the score fourth and eighth grade students receive on an 
unspeeded multiple-choice achievement test. It is not clear, how- 
ever, as to what sort of difference it makes. There is no predictable 
direction thai pertains to both sexes or that prevails from grade 
four to eight. In the present study, the sex of the student at the 
fourth grade level made no difference, but the interaction of sex 
and the type of answer sheet was significant. The boys reacted in 
one way, and the girls in another. For the girls it made no practical 
difference which answer sheet they received. For the boys, on the 
other hand, it made a significant difference; they found it easiest 
to use the Digitek sheet and the most difficult to use the IBM 805 
sheet. This suggests, perhaps, that girls by the time they reach grade 
four have mature motor coordination, whereas the boys do not. 
These findings may indicate that the full potential for grade four 
boys may not be tapped unless the high density Digitek answer 
sheet is used. 

Since the principals and teachers at the participating schools in- 
dicated that they expected the IBM 805 open format with items 
arranged in vertical columns to be easier to handle than the denser 
format with horizontal rows of items typical of the IBM 1230 and 
Digitek formats, it is interesting to find that the fourth grade boys 
performed in the opposite direction. In the study by Dizney et al. 
(1966), similar attitudes prevailed, where a preference for the IBM 
805 format over the IBM 1230 format was found. 

At the eighth grade level where there was a sex difference, the 
girls obtained higher scores than the boys when differences in 
verbal ability were ruled out. Here it may be possible to say that the 
boys and girls were equally mature in motor coordination and at- 
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tribute the difference to clerical accuracy, an ability often associ- 
ated with the feminine gender, 

Clerical speed, it is believed, can be eliminated as a contribut- 
ing factor. At both grade levels most of the students completed the 
STEP Reading Test before the time deadline. The average student 
finished 96 to 98 per cent of the test. Thus the findings of this 
study may be relevant to unspeeded tests in general. They should 
not, however, have much bearing on speeded tests. In the Merwin 
(1963) study, it will be remembered, the MRC answer sheet (which 
has a dense format like the IBM 1230 and Digitek sheets) pro- 
duced lower results than the IBM 805 answer sheet on a speeded 
test of clerical abilities, 

From the test users’ standpoint, it might be advisable to con- 
sider using with unspeeded tests a high density answer sheet, such 
as Digitek, for boys in the lower grades. For girls, one answer sheet 
should work as well as another. 

From the test publishers’ standpoint, there is an apparent need 
to provide normative data applicable to each new answer sheet for- 
mat as it appears and to provide information on the speededness of 
tests. In some cases, the norms for one answer sheet may be ap- 
plicable to results on another answer sheet or require only minor 
adjustments. 

The STEP Reading Tests, which were standardized on IBM 805 
answer sheets, may need to be restandardized with the newer high 
density sheets. It seems that the existing norms are probably over- 
estimates (ie. too high) when applied to high density answer 
Sheet results from both sexes at the fourth grade level and under- 
estimates when applied to girls’ results on the IBM 1230 answer 
sheet at the eighth grade level. The range of means from 44.16 to 
45.97 for eighth grade girls, however, represents a significant 
rather than practical difference, 

In future studies, there are a number of variables still to be in- 
vestigated. It is suggested that samples of different ability levels 
be used; the present study dealt with high ability students who may 
have been challenged to do well on the various answer sheets and 
who were therefore highly motivated. The role of practice could also 
be explored when comparing answer sheet formats. And finally, 
measures of motor coordination and visual-motor perception might 
be added variables to consider, for there are indications that visu- 
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ally handicapped students, for example, are less able to handle 
separate answer sheets. 

Summary. 'The STEP Reading Test, Form 4A, was administered 
to a sample of fourth graders (N = 108), and Form 3A was ad- 
ministered to a sample of eighth graders (V= 126). Each group was 
subdivided by sex, with a random third of each subdivision using an 
IBM 805 answer sheet, another third using an IBM 1230 answer 
sheet, and the remaining third using a Digitek answer sheet. Each 
group also took the Academic Ability Test on SCRIBE answer 
sheets, from which the Verbal subscore was used to adjust the STEP 
Reading mean scores for the influence of verbal ability. 

The F ratios revealed that significant differences in the adjusted 
mean scores on the STEP Reading Test were attributable to an- 
swer sheet format at the .05 level at both grade levels tested and 
to sex at the .81 level at grade eight. A significant interaction of sex 
and answer sheet format, however, was observed, indicating that 
boys and girls reacted in different ways to the three formats. 

It was concluded that answer sheet format should be taken into 
consideration when interpreting test results from currently avail- 
able normative data. 

At the fourth grade level where no overall sex difference was 
found, the girls performed similarly on all three answer sheets, but 
the boys found the Digitek sheet the easiest and the IBM 805 
sheet the most difficult, contrary to the expectations of the princi- 
pals and teachers at the schools involved. 

At the eighth grade level the girls scored higher than boys on all 
formats and found the IBM 1230 answer sheet most difficult. For 
the boys all answer sheets were about equally difficult. 

Tt appears that girls may have greater motor coordination and/or 
clerical accuracy, because in this study they received higher scores 
than the boys at the eighth grade level and encountered no difficulty 
on any type of answer sheet at the fourth grade level. 
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EFFECT OF PERCEIVED SCORING FORMULA ON 
SOME ASPECTS OF TEST PERFORMANCE! 


І. К. WATERS 
Ohio University 


Tue instructions given to examinees who are about to take an 
objective type test may encourage them to guess if they are not sure 
of an answer, or may direct them not to guess, and may set up scor- 
ing penalties intended to enforce the desired behavior. For exam- 
ple, examinees may be told not to guess—that the number of wrong 
answers, or some multiple thereof, will be subtracted from the 
number of right answers they obtain. While the test performance 
of examinees under instructions designed to encourage or discour- 
age guessing has been studied (Jackson, 1955; Keislar, 1953; 
Swineford and Miller, 1953), there is little empirical information 
on the differential effects on test taking behavior of informing ex- 
aminees that one or another penalty formula is to be used in scor- 
ing. The purpose of the present study was to examine the effects 
on test performance of six different instructions on scoring. 

Procedure. The responses of 123 flight students in the indoc- 
trination week of pre-flight training were used to obtain p-values on 
a pool of vocabulary items. Hach item consisted of a stem word and 
five alternative words, from which respondents indicated the one 
most nearly opposite in meaning to the stem word. 

From this item pool, two 50-item tests (Forms A and B) were 
constructed by matching the p-values (+.02) of items. The tests 
had rectilinear distributions of item difficulties ranging from .08 
to .98. Both tests were labeled as a “Word Knowledge Test.” 

1 This study was conducted at the Naval Aerospace Medical Institute, 
Pensacola, Florida. Opinions or conclusions contained in this report are those 


of the author. They are not to be construed as necessarily. reflecting the view 
or the endorsement of the Navy Department. 
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Form A was administered without time limit to 420 flight students 
in the early part of the indoetrination week of pre-flight training. In- 
structions for marking responses to the items were printed on the 
test booklet, but no information was given regarding how the test 
was to be scored. 

Form B was administered to the same sample of flight students 
near the end of the same week. The general instructions for re- 
sponding to the items were the same as for Form A. However, one 
of six different sets of scoring instructions was printed on each 
Form B test booklet. These six sets of instructions were as fol- 
lows: 


1. Scoring method not specified. 

2. Examinees told each right answer counted one point for them 
and a wrong answer did not count anything against them. 

3. through 6. Examinees told each right answer counted one point 
for them and each wrong answer counted 14, 1, 2, or 4 point 
against them. Also examinees were told that omitted items 
would not count either way. 


After the examinees had finished Form B, all regular marking 
pencils were collected and red pencils were distributed. The ex- 
aminees were then told to attempt to answer all of the items they 
had omitted. This procedure provided a means of obtaining per- 
formance data on both originally attempted and originally omitted 
items. 

Each of the six scoring instructions was used on approximately 
one-sixth of each class tested. This resulted in six groups (one for 
each scoring instruction), ranging from 61 to 79 cases each. 

_ Results and discussion. Table 1 shows mean scores for Form A 
(Ra), the initially attempted items on Form B (Ев), and all the 
items on Form B (Rar). The differences between Вл and Ерт were 
not statistically significant, although there was a consistent dif- 
ference of about one item between the mean numbers of right ans- 
wers to the two forms. On Ев, however, the numbers of right 
answers for all groups that were told that wrong answers would be 
penalized differed significantly (P < .05) from both R4 and Rpr 
means. Further, it was established by use of Duncan’s Multiple 
Range Test (1955) that mean Ry scores for groups with unspecified 
or zero weights for wrongs differed significantly from mean Rp scores 
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TABLE1 
Mean Rights for Siz Groups on Two Forms of a Vocabulary Test 


Groups with weight for wrongs 

2 3 4 5 

unspec. 0 -1/4 --1 -2 —4 
Ел (Form A) 26.79 26.23 26.37 26.45 25.95 27.55 
Ев (Form B) 25.20 24.97 22.15 20.46 20.46 20.84 


Евт (Form B) 25.59 25.33 25.37 25.26 24.96 26.22 
Rg/ Attempts .53 .51 .59 .62 .64 .68 


cantly, with increases in scoring weights. 


wrongs. 


TABLE 2 
Mean Omits for Six Groups оп Two Forms of a Vocabulary Test 


Groups with weight for wrongs 
2 3 4 5 
unspec. 0 —1/4 -1 —2  -4 

Ол (Form A) 1.33 1.44 1.32 1.37 43 


1.23 
О» (Form B) 2.05 1.23 12.31 16.88 18.56 19.36 
Ов — Ол .82  —.10 10.87 15.56 17.19 18.93 
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for all groups with specified weights greater than zero. Mean number 
of rights for groups with specified weights greater than zero did not 
differ significantly from each other. The proportion of right re- 
sponses (rights/attempts) increased consistently, but not signifi- 


Table 2 compares the mean numbers of items omitted on Form 
А (Ол) with those omitted on Form B (О), and the differences 
between these means as the penalty for wrong responses increased. 
The groups were roughly comparable in terms of number of omitted 
items on Form A (F « 1.0), but differed substantially on Form B as 
a function of the specific scoring sets (F = 34.73; df = 5,414). 
each increment in the specified scoring weight for wrong responses, 
there was a significant (P « .05) increase in the mean number of 
omits, except for the difference between the unspecified and zero 
conditions. Apparently, examinees regarded no specific scoring in- 
structions in much the same manner as specified zero weights for 


For 


For groups with scoring weights greater than zero, the more able 
examinees (higher Ra Scores) tended to omit fewer items on Form 
B (r's ranged from —.33 to —.42) and to get somewhat more of the 
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originally omitted items correct when required to answer (r's ranged 
from .18 to .26). There was no consistent pattern of relationships 
with variations in scoring weights. For either unspecified or zero 
weight groups, essentially no relationship was found between Вл 
scores and number right of the originally omitted Form B items 
(r’s = .01 and .09). Form A rights and Form В omits correlated 
—.24 and —.01 for these groups, respectively. 

Examinee test performance in terms of omitted items was af- 
fected in a consistent manner by variations in the penalty they 
were told would be applied to wrong responses. The more able ex- 
aminees, as might have been expected, omitted fewer items but 
still got more of the originally omitted items correct when required 
to answer. 

Table 3 gives the correlations between R4 scores and (a) rights 
on attempted Form B items (Rz), (b) total rights on Form B 
(Rar), and (с) total rights on Form В when random responses 
were assigned to omitted items (R55). The random responses were 
assigned by use of a table of random numbers. After the responses 
had been assigned to each omitted item, these were scored and the 
number right added to the № score. 

The differences in interest, correlations involving the three Form 
B rights scores were generally very small, in large part due to the 
fact that В» was a large component of both R5; and Rpr. How- 
ever, it was of interest to compare rights scores under the various 
Scoring sets and rights scores when examinees had been required to 
answer every item under both actual “informed” guessing and 
simulated “random” guessing conditions, 

The two groups without any specified penalty for guessing, who 
omitted less than 4 per cent of the Form B items, showed no con- 


TABLE 3 
Interest Correlations for Siz Groups with Three Types of Rights Scores on Form B 
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sistently different relationships among the three sets of Form B 
rights scores. However, the four groups with a specified penalty for 
wrongs, who omitted from 25 per cent to 40 per cent of the Form B 
items, demonstrated higher intertest correlations for E; than for 
Ев scores. Also, in terms of the intertest correlations, informed 
guessing was superior to "random" guessing. In this study, re- 
quiring examinees to answer every item led to higher intertest co- 
efficients, but for examinees who guessed at random it would have 
been better not to force them to respond to every item. The first 
three rows of Table 4 present the intertest correlations between 
Вл and three types of Form B scores (Ep, the Form B score obtained 
by using scoring formula appropriate to the specified category, and 
the “best” Form B score.) It is obvious that the scoring formula 
appropriate to the set given examinees on Form B yielded progres- 
sively smaller coefficients compared to tights only as the penalty 
for wrongs given examinees increased. Examinees did not adapt 
their performance on Form B in relation to the scoring set given 
them, It can be seen that R-1/4W was the best of the scoring 
formulas used for all groups with specific scoring instructions, ex- 
cept for the most extreme group. In general, the scoring formula 
appropriate to the structure of the test was better than the scoring 
formula appropriate to the scoring set. While the number of items 
attempted decreased as the weights for wrongs increased, exam- 
inees with the more severe penalities did not omit enough items. 
Whether examinees overestimated their probability of success on 


several items or did not perceive the severity of the penalty cannot 
be determined from these data. 
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CREATIVITY AND MEDIATED ASSOCIATION: A 
CONSTRUCT VALIDATION STUDY OF THE ВАТ! 


JERRY HIGGINS лмо LISE-LOTTE LEHD DOLBY 
University of California, Santa Barbara 


Tux Remote Associates Test of creativity (RAT) (Mednick, 1967) 
was constructed in accordance with the postulate that creativity 
consists of the ability “to form associative elements into new com- 
binations by providing mediating connective links” (Mednick, 1962, 
p. 226). It follows that individuals obtaining high scores on the 
RAT should perform in a superior fashion on a learning task involv- 
ing mediated association. The present study was designed to test 
this hypothesis. 

Method—Material. Form I of the RAT was employed, A self- 
administered, paper-and-pencil test, the RAT consists of 30 items. 
Each item consists of three stimulus words which the testee is asked 
to relate to each other by means of a single common association. A 
fictitious RAT item might be: 

surprise line birthday 
The answer is “party.” The testee’s RAT score is the number of 
correct answers. Reviews of the development and current status 
of the RAT may be found in the examiner’s manual (Mednick and 
Mednick, 1967), or in an earlier, but more detailed, report by 
Mednick and Mednick (1965). 

The learning task consisted of a paired-associate list containing 
10 pairs, of which five pairs were mediated by a single associative 
link (A-C pairs), and the remaining five pairs were nonmediated 
~i This research was supported by a Faculty Research Grant and a Summer 
Faculty Fellowship to the senior author from the University of California. 
The authors thank Sarnoff A. Mednick and Martha T. Mednick of the Uni- 


versity of Michigan and Stanley W. Osgood of the Houghton Mifflin Company 
for their kind permission to use the RAT prior to its publication. 
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(А-Х pairs)? An example of an A-C pair is “fruit-red,” since 
“apple” serves as а mediating association, i.e., the stimulus “fruit” 
elicits the association “apple” which in turn elicits the response 
“те” (АВС). Thus, association on the basis of such mediators 
as “apple” would facilitate learning of such mediated, or A-C, 
pairs ав “fruit-red.” In contrast, the pair “fruit-hand” would be an 
example of an A-X pair. While the stimulus “fruit” would con- 
tinue to elicit the association “apple,” the latter would not tend to 
elicit the response “hand” (А-В X), ie. there would be no 
strong mediating association. Consequently, mediated association 
would not facilitate, and might even interfere with, learning of such 
nonmediated, or А-Х, pairs as “fruit-hand.” 

Two forms of the paired-associate list were used. Both forms 
contained the same 10 different stimulus members, and both con- 
tained five A-C pairs and five A-X pairs. The stimulus members of 
the A-C pairs on Form 1 were the stimulus members of the A-X 
pairs on Form 2 (e.g., “fruit-red” and “fruit-hand,” respectively), 
while the stimulus members of the A-X pairs on Form 1 were the 
stimulus members of the A-C pairs on Form 2. 

Procedure. Wighty-three female introductory psychology students 
served as Ss. Alternate Ss were presented with Form 1 or Form 2 of 
the paired-associate list on a memory drum. The list was pre- 
sented at a 4-second rate in three different random orders (al- 
ways beginning with the same order). Learning was taken to a cri- 
terion of two successive perfect trials. 

Following completion of paired-associate learning, each S was 
interviewed concerning the strategy she used in learning the list, 
and was requested to give specific examples. Fully 41 per cent, or 
34, of the Ss were aware of the nature of the paired-associate list 
as evidenced by the explicit verbalization of one or more A> 
ВО sequences during the post-learning interview. That such an 
unusually large proportion of Ss was aware of the mediating as- 
sociations is probably accounted for by the relative transparency 
of the sequences owing to their shortness (only one mediating as- 
sociation). In any event, since this knowledge would tend to se- 
verely attenuate the effect of creativity (ie., an aware S would, 


? This list was drawn from the “B-D” and *B-X" pai 
I -X" pairs constructed by 
Russell and Storms (1955, р. 290). Justification for classification of the pairs 
as mediated or nonmediated was determined by associative norms. 
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regardless of level of creativity, “know what to look for’), only 
the results of the 49 unaware Ss were usable. 

Finally, each S received the RAT. A 30-minute time limit, of 
which she was informed, was imposed. 

It was predicted that performance on the RAT would be posi- 
tively correlated with learning of A-C pairs relative to learning of 
A-X pairs. 

Results and discussion. The Ss achieved a mean RAT score of 
18.67, with a standard deviation of 4.95 and a range from 3 to 24. 
RAT scores were not associated with form of list (rp, = .035). 
However, a disproportionate number of Ss (36) received Form 1. 
Thus, it was possible for creativity to interact with the response 
members of Form 1 independently of mediation and confound the 
results, but there is no a priori reason to expect such an inter- 
action, E 

Three indices of paired-associate learning were employed: (a) 
number of trials to criterion, (b) number of errors (overt and co- 
vert) to criterion, (c) number of correct anticipations on the first 
trial. The means and standard deviations, respectively, for the Ss on 
these three indices for the total list were: 2.41 and 1.38 for trials; 
5.49 and 4.25 for errors; 6.76 and 1.79 for correct first-trial an- 
ticipations. In calculating the relationship between RAT score and 
learning of A-C pairs, learning of A-X pairs was held constant 
through partial correlation. The relationship between RAT score 
and number of trials to criterion for A-C pairs was in the predicted 
direction and statistically significant (r = —.277, df = 46, p < .03, 
one-tailed). The relationship between RAT score and number of 
errors to criterion for A-C pairs was also in the predicted direction 
and significant (r = —.293, df = 46, p < .025, one-tailed). The 
relationship between RAT score and number of correct anticipations 
on the first trial for А-С pairs was again in the predicted direction 
and significant (r = .282, df = 46, p < .03, one-tailed). It may be 
noted that the range of RAT talent of the present sample was re- 
stricted at the upper levels, and the ceiling on the learning task was 
quite low, indicating that the obtained correlation coefficients are 
conservative. If the correlation coefficients are corrected for restric- 


8 Thus, for example, the aware Ss learned the A-C pairs in fewer trials 
relative to the A-X pairs than did the unaware Ss (¢ = 2.79, df = 81, 
р < 01, two-tailed). 
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tion in range of RAT scores alone on the basis of a sample of 579 
female UCLA undergraduates (Mednick and Mednick, 1967), the 
coefficients of correlation between RAT score and learning of A-C 
pairs rise in magnitude to —.295 for trials, —.312 for errors, and 
.301 for correct first-trial anticipations. 

All three learning measures, ie., trials, errors, and correct re- 
sponses on the initial trial, provided support for the hypothesis of 
2 positive relationship between RAT score and performance on a 
mediated-association learning task. It was therefore concluded that 
creativity as measured by the RAT involves mediated association. 

Summary. The RAT was administered to 49 female undergradu- 
ates following learning of a paired-associate list which contained 
mediated and nonmediated pairs. As predicted on the basis of the 
theoretical rationale underlying the test, ВАТ scores were posi- 
tively related to learning of the mediated pairs relative to the non- 
mediated pairs for trials, errors, and correct anticipations on the 
first trial. 
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DIVERGENT AND CONVERGENT ABILITIES OF 
SEMANTIC CONTENT A8 RELATED TO SOME 
PERSONALITY TRAITS OF COLLEGE STUDENTS! 


GEORGE WINDHOLZ 
Teachers College, Columbia University? 


A RECENT review of research dealing with “creativity” (Golann, 
1963) shows a variety of interests and approaches. One line of 
research has as its aim the determination of the relation of “cre- 
ativity” and “intelligence” to personality traits. 

In an early study, Guilford and associates (1957) correlated a 
number of tests measuring convergent and divergent abilities with 
measures of temperament and motivation. Subsequently, a num- 
ber of studies investigated the relationship of personality traits to 
constellations of divergent and convergent abilities (Barron, 1957; 
Wallach and Kogan, 1965). 

The present study was designed to determine the relationship 
between some personality traits obtained with self-descriptive and 
objectively scored inventories, and the constellations of divergent 
and convergent abilities, semantic in content, as defined within 
Guilford’s model of the Structure of Intellect (Guilford and Hoepf- 
ner, 1963). To be more specific, the aim was to determine the nature 
of the relationship of twenty-six traits of temperament, interest, 
and value to the composite of six convergent and six divergent tests 
and to the difference between six pooled divergent test scores and six 
pooled convergent test scores. 


1 This paper is based on a dissertation submitted to the Teachers-College, 
Columbia University in partial fulfillment of the requirements for the Ph.D. 
degree. The author wishes to thank R. L. Thorndike, chairman of his com- 
mittee, Elizabeth Hagen, and A. T. Jersild for their advice and criticism. 

2 Now at The University of North Carolina at Charlotte. 
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Method—Subjects. The subjects were 222 male and female college 
undergraduates enrolled in courses of introductory psychology at 
The University of North Carolina at Charlotte. 

Psychological measures. The selection of the intellectual abilities 
was guided by Guilford’s Structure of Intellect (Guilford and 
Hoepfner, 1963) model. Each ability was represented by a factor 
that enters into the model. Each factor included in the Structure 
of Intellect is defined in terms of three dimensions: operation, 
product, and content. Abilities used in the present study were all 
of semantic content but of two operations: divergent and convergent 
productions. Each operation consisted of six products. Conse- 
quently, convergent abilities were measured by six convergent tests 
of semantic content, and the divergent abilities were measured by 
six divergent tests of semantic content. The twenty-six personality 
traits used in the present study were obtained ‘through self- 
descriptive, objectively scored inventories. 

Procedure. Administration of the twelve divergent and convergent 
tests took place during two separate sessions. The order of adminis- 
tration was as follows: 


First Session: 


Factor DMU Test: Ideational Fluency I 
Factor NMU Test: Picture-Group Naming 
Factor DMC Test: Alternate Uses, Form A 
Factor NMC Test: Word Grouping 

Factor DMR Test: Associational Fluency I 
Factor NMR Test: Inventive Opposites 


Second Session: 


Factor DMS Test: Epressional Fluency 
Factor NMS Test: Picture Arrangement 
Factor DMT Test: Plot Titles (clever) 
Factor NMT Test: Gestalt Transformation 
Factor NMI Test: Possible Jobs 

Factor NMI Test: Sequential Association 


In the next two sessions the following inventories were ad- 
ministered: The Guilford-Zimmerman Temperament Survey 
(Guilford and Zimmerman, 1949), the Kuder Preference Record 
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(Vocational) (Kuder, 1960), and the Study of Values (Allport, 
Vernon, and Lindzey, 1960). 

Each test of intellectual abilities was scored by a single scorer. 
The raw scores of each test were subjected to nonlinear trans- 
formation and normalization (Gulliksen, 1950, pp. 276-282). 
For each raw score the corresponding z-score was found and trans- 
formed into T-score. The T-scores of each subject were pooled 
each with an equal weight, into the following two scores: a Total 
Divergent score, obtained by pooling the six divergent T-scores for 
each subject, and a Total Convergent score, obtained by pooling 
the six convergent T'-scores for each subject. 

Using these total scores, the following two ability constellations 
were obtained: a Composite score which consisted of the sum of 
the Total Divergent score and the Total Convergent score, and the 
Predominance score obtained by subtracting the Total Convergent 
score from the Total Divergent score for each subject. The follow- 
ing definition of the Predominance score was given: whenever a 
subject’s Total Divergent score was higher than his Total Con- 
vergent score, the difference magnitude was defined as high Pre- 
dominance; whenever the subjects’s Total Convergent score was 
higher than his Total Divergent score, the difference magnitude 
was defined as low Predominance. Whenever the subjects’s Total 
Divergent score was equal to his Total Convergent score, the re- 
sulting lack of difference was defined as zero Predominance. To 
eliminate negative values a constant of 100 was added to each 
Predominance score. 

To determine the relationship between personality traits and 
the constellations of intellectual abilities, the Predominance and 
Composite scores were correlated (Pearson's r) with the twenty- 
six scores of temperament, interest, and value traits. 

Results. The reliability of the Predominance score, as com- 
puted by Mosier formula (Guilford, 1954, pp. 393-394) was .679, 
and the reliability of the Composite score, as computed by Mosier 
(1943) formula was .815. 

Table 1 shows the main findings of the present study; namely, 
the correlation coefficients of the temperament, interest, and value 
trait scores to the Predominance and Composite scores. Most of 
these correlation coefficients were not significant. However, some 
correlations as high as .247 were found; still low but significant 
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TABLE 1 
Correlations between Twenty-Six Personality Trait Scores with Predominance 
and Composite of Divergent and Convergent Ability Scores, Semantic in 
Content (N = 222) 


Score 
Trait Predominance Composite 
Quilford-Zimmerman 
Temperament Survey 
1. General Activity .102 .208* 
2. Restraint — .008 —.093 
3. Ascendance .036 .189* 
4. Sociability .045 .089 
5. Emotional Stability —.173* .129 
6. Objectivity Е —.182* .076 
7. Friendliness — .061 —.015 
8. Thoughtfulness 02 * .036 
9. Personal Relation —.070 .042 
10. Masculinity — 169 —.027 
Kuder Preference J 
Record (Vocational) 
11. Outdoor —.084 —.004 
12. Mechanical —.160 —.168 
13. Computational —.158 —.053 
14. Scientific —.079 —.035 
15. Persuasive —.006 008 
16. Artistic .063 .023 
17. Literary .131 .188* 
18. Musical —.001 .178* 
19. Social Service .136 .082 
20. Clerical —.055 —.116 
Study of Values 
21. Theoretical — .049 049 
22. Economic —.132 —.173* 
23. Aesthetic .067 .247* 
24. Social 141 .010 
25. Political 5 —.037 —.016 
26. Religious .010 —.119 


"Significant at p = .01 level, nondirectional test. 
b 


at the .01 level in a nondirectional test. Any interpretation of re- 
sults must take into consideration the very low degree of relation- 
ship between intellectual ability constellations and personality 
traits. It should also be noted that the descriptive terms used in 
the following paragraphs are those used by the inventories! authors 
to characetrize the traits on which significant correlations were 
found. 

Table 2 shows the correlation of two temperament trait scores and 
of each of those two traits with the Predominance score. The re- 
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: TABLE 2 
The Relation of the Predominance Score to Certain Personality Trait Scores 
yet * and Their Intercorrelation (М = 222) 
—— 


Variable 5 6 
. Predominance —.173* —.182* 
5. Emotional Stability .671* 
6. Objectivity 


*Significant at р = .01 level, nondirectional test, 


lationship. indicated that subjects who were higher in divergent 
than convergent ability, semantic in content, tended to describe 
themselves as showing frequent mood, interest and energy fluctua- 
tions, and a tendency to be more worried, to have more frequent 
daydreams, and to be more hypersensitive, suspicious, and self- 
centered thait subjects who were of higher convergent than divergent 
ability. 

As Table 3 shows, six personality trait scores and the Composite 
score form a cluster. Subjects who were high on the Composite 
dimension tended to describe themselves as energetic, efficient, vital, 
persuasive, outspoken, more inclined to leadership, more inter- 
ested in reading literary works and participating in musical activi- 
ties, placing more value on form, harmony, and the aesthetic ex- 
perience, while placing less value on the practical and useful 
aspects of daily life. 

Since two personality trait scores were found to be correlated with 
the Predominance score and six different personality trait scores 
were found to be correlated with the Composite score, (see Table 


TABLE 3 
' The Relation of the Composite Score to Certain Peronality Trait Scores and 
Their Intercorrelation (N = 222) 


Variable 1 3 17 18 23 22 
И O аа 
Composite .208*  .180* .188* “.178* .247* —.173* 
1. General Activity .270* .018 | —.092  —.098 .019 
3. Ascendance .090 2043  —.019 .042 
17. Literary .163 .207* —.300* 
18. Musical ат —.341* 
23. Aesthetic — .548* 


22. Economic 
"Significant at p = .01 level, nondirectional test. 
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1), the correlations of the Predominance score and its related per- 
sonality trait scores and of the Composite score and its related 
personality trait scores were established. Table 4 shows that out of 
the possible twenty-one correlation coefficients only four were found 
to be significantly correlated at the .01 level, in a nondirectional 
test, indicating two largely independent intellectual ability-person- 
ality trait clusters. It follows that subjects can be described in terms 
of dimensions that are largely independent, namely the Predomin- 
ance-personality dimension and the Composite-personality dimen- 
Sion and when these dimensions are respectively split at their medi- 
ans, each subject’s personality trait score can be assigned to one of 
the following groups: HH, High on Predominance and high on 
Composite dimensions; HL, “High on Predominance and low 
on Composite dimensions; LH, Low on Predominance and high on 
Composite dimensions; LL, Low on Predominance «and low on 
Composite dimensions. Each of these constellations is character- 
ized by certain personality trait scores and intellectual dimension 
Scores and expressed in terms of means and standard deviations 
in Table 5. Since the deyiations of the subgroup personality trait 
Score means from the personality trait score means for the total 
group are rather small, the interpretation of personality trait char- 
acteristics of each subgroup must be taken as indicative or sug- 
gestive rather than definitive. 

Subjects whose mean scores were either high or low on the Pre- 
dominance dimension and high on the Composite dimension tended 
to describe themselves on the average as active, ascendant, 
emotionally stable, not hostile or hypersensitive, but interested in 


TABLE 4 


Correlations between the Composite-Personality Variables and Predominance- 
& Personality Variables (N = 222) 


Emotional 


Variables Predominance Stability Objectivity 
Composi: * 

posite .202* 
General Activity .102 eus “088 
Ascendance .036 .216* .220* 
Literary .131 —.010 .107 
Musical " —.001 —.004 —.041 
Aesthetic .067 —.115 —.015 
Economic —.182 084 051 


*Significant at p = .01 level, nondirectional test. 
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TABLE 5 


Means and Standard Deviations for Personality Trait Scores and for Intellectual 
Ability Constellation Scores for Four Subgroups of Subject Obtained by 
“Splitting the Predominance and Composite Score Distributions at 
Their Respective Medians and the Means of the Variables for the 
Total Group (N = 222) 


Variables High Predominance 

Total Low Composite High Composite 

Group Standard Standard 

Mean Mean Deviation Mean Deviation 
General Activ. 47.36 47.09 15.03 48.83 10.20 
Ascendance 46.84 44.12 10.81 50.67 9.22 
Emot. Stability 47.99 45.68 10.63 48.08 9.51 
Objectivity 48.38 46.76 9.59 48.67 8.18 
Literary 54.99 54.07 * 30.23 60.48 26.68 
Musical 56.79 50.92 10.09 59.05 27.07 
Economic 40.72 42.08 8.66 38.55 9.43 
Aesthetic R 38.89 37.49 11.00 40.17 8.77 
Predominance 100.20 127.02 21.12 131.57 22.69 
Composite 598.41 556.06 30.08 650.82 38.03 

N = 51 М = 60 
Variables Low Predominance 

"Total Low Composite High Composite 

Group Standard Standard 

Mean Mean Deviation Mean Deviation 
General Activ. 47.36 44.07 9.35  ' 48.82 12.12 
Ascendance 46.84 44.83 10.77 47.45 8.89 
Emot. Stability 47.99 47.41 10.00 50.88 9.16 
Objectivity 48.38 48.42 9.95 49.61 8.86 
Literary 54.99 48.37 29.20 56.65 32.91 
Musical 56.79 57.13 30.21 59.63 29.48 
Economic 40.72 41.92 8.77 40.54 8.77 
Aesthetic 38.89 37.05 9.64 40.98 8.72 
Predominance 100.20 69.02 23.92 73.39 19.44 
Composite 598.41 550.43 34.96 641.49 34.31 

М = 60 N = 51 


literary and musical activities, and valuing the aesthetic. ex- 
perience but not the pragmatic aspects of daily life. 

Subjects whose mean scores were high on the Predominance 
dimension and low on the Composite dimension tended to describe 
themselves on the average as inactive, submissive, ‘emotionally 
unstable, hostile and hypersensitive, not interested in musical or 
literary activities, and valuing less the aesthetic experience and 
more the pragmatic aspects of daily life. 

Subjects whose mean scores were low on both the Predominance 
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and the Composite dimension tended to describe themselves on the 
average, as emotionally unstable, not hostile or hypersensitive, 
but inactive, submissive, not interested in literary but interested 
in musical activities, not valuing the aesthetic experience but 
valuing the pragmatic aspects of daily life. К 

Discussion. Perhaps the most notable finding of the present study 
is the negative relationship of the Predominance score to the Emo- 
tional Stability and Objectivity trait scores of the Guilford-Zim- 
merman Temperament Survey. The Predominance score is basic- 
ally a difference index and it was found that subjects who scored 
higher on divergent than convergent abilities, semantic in content, 
described themselves in terms that Guilford and Zimmerman 
(1949, p. 9) declare to be indicative of neurotic tendencies. 

However, the fourfold classification of subjects as seen in Table 
5 does not show that subjects who were high on the Predominance 
dimension, regardless of whether they were high or low on the 
Composite dimension, described themselves in terms of emotional 
instability and hypersensitivity. Apparently, the correlation among 


t "results. Still, subjects who were high on the 
Predominance dimension and high on the Composite dimension 
deseribed themselves in terms of a slightly greater tendency to emo- 
tional instability and hypersensitivity than subjects who were low 
on the Predominance dimension and high on the Composite di- 
mension. 

Despite the somewhat confounded findings, it appears that a fu- 
ture investigation of the role of discrepancy of divergent and con- 
vergent abilities to neurotic tendencies would be in order. 

Summary. This study was designed to determine the relationship 
of twenty-six traits of temperament, interest and value to two con- 
stellations (Predominance and Composite) of divergent and con- 
vergent abilities of semantic content, as defined within Guilford’s 
model of the Structure of Intellect, Subjects whose mean scores 
were either high or low on the Predominance dimension and high 
on the Composite dimension tended to describe themselves on the 
average as active, ascendant, emotionally stable, showing preference 


for the aesthetic experience but not for the pragmatic experience, | 


Roughly the converse self-description was obtained from subjects — 


— Oe 
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who were either high or low on the Predominance dimension and 
low on the Composite dimension. | 
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IMPRESSION FORMATION AS A MEASURE OF THE 
COMPLEXITY OF CONCEPTUAL STRUCTURE? 
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Tum interest of researchers in complexity theory (differentiation 
and integration of conceptual dimensions) has increased greatly 
over the last few years. Many points of view have been presented 
(Barron, 1953; Berkowitz, 1957; Bieri, 1955; Driver and Streu- 
fert, 1964, 1966; Ehart, 1957; Harvey, Hunt and Schroder, 1961; 
Higgins, 1959; Leventhal, 1957; Lundy and Berkowitz, 1957; 
Mayo and Crockett, 1964; Plotnick, 1961; Schroder, Driver and 
Streufert, 1967; Scott, 1962, 1963; Sechrest and Jackson, 1961; 
Ware, 1958; Witkin, Dyk, Faterson, Goodenough, and Karp, 1962; 
Zajone, 1960). Whenever predietions from complexity theory 
have been tested, a number of differentiation or integration meas- 
ures have been used which sometimes seem to have little in com- 
mon. A factor analytic study of Vannoy (1965) has born out the 
conclusions of Scott (1962, 1963) that there are few communali- 
ties in the measures of complexity. There are at least three possible 
interpretations of such a finding. First, one might assume that most 
or all of the measures do not measure differentiation and integra- 
tion characteristics correctly. However, the fact that predictions of 
many researchers with these tests have had significant results, tends 
to argue against that interpretation. Second, one might conclude, 
that human information processing as expressed in the complexity 


1 Support by the Office of Naval Research, Group Psychology. Branch, is_ 
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of differentiation and integration of cognitive dimensions is by no 
means a unitary phenomenon. A third conclusion—no matter 
whether complexity is seen as unitary or multifaceted—could be 
that differentiation and integration may vary as a function of con- 
tent area and as a function of situational properties. An inspection 
of the different measures of complexity that have been proposed in- 
dicates that they do emphasize complexity in differing content 
areas. Subjects used by experimenters predicting from complexity 
differences have come from a variety of (uncontrolled) environ- 
mental conditions where they experienced different levels of en- 
vironmental exposure. There is probably little question, that dif- 
ferent environmental pre-conditions to testing, and differences in 
the content area in which the complexity of information processing 
is tapped, would result in considerable differences in complexity of 
performance. 

A look at the work of Vannoy (1965), however, would suggest 
that our third conclusion may be quite valid; yet it would hardly 
account for his results in their entirety. In addition, the work of 
Streufert and Schroder (1965) and Streufert and Driver (1965) 
indicates that it is unlikely that overly stressing (overloading) en- 
vironments would produce one kind of complex information process- 
ing, while optimal environments would produce another, and less 
optimal environments yet another (cf. Vannoy's factors). It is 
without question that there must be concern with pre-test, pre- 
experimental environments and with content area differences. 
However, the first concern necessarily is with the components of 
complexity, and with the identification of tests which are specifically 
designed to measure these components. 

Components of complexity. (1) Differentiation and Integration. 
Earlier theories of complexity (e.g. Witkin et al, 1962) were pri- 
marily concerned with differentiation and sometimes integration. 
In some (Harvey et al, 1961), differentiation is seen as a necessary 
but insufficient antecedent to integration, while some of the very 
recent approaches (cf. Schroder, Driver and Streufert, 1967) con- 
ceive of the two as possibly more independent information process- 
ing characteristics, which are correlated, but not necessarily pre- 
conditions for each other. 

° (2) Perceptual and executive subsystems. In a recent amplifica- 
tion of the complexity model, Driver and Streufert (1964, 1966) 
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have postulated that one must consider at least two sub-systems of 
human information processing, each of which would handle dif- 
ferentiation and integration in characteristic degrees. The percep- 
tual sub-system is largely concerned with data search and intake. 
The executive subsystems, on the other hand, would utilize the more 
or less differentiated and integrated concepts generated by the 
perceptual subsystem to make decisions (produce behavioral out- 
put), which would again be more or less differentiated and inte- 
grated. ' 

(3) Social and nonsocial areas. Finally, опе more distinction may 
be made which is relevant to the particular environment which 
produces the input and upon which the behavioral output will take 
effect. Here we can again distinguish between at least two different 
forms of complexity: social complexity (concerned with interper- 
sonal percept&on and interaction), and nonsocial complexity (con- 
cerned with behavior in environments where perceptions and de- 
cisions have no interpersonal relevance.) 

Measurement. At least three different ways exist in which com- 
plexity may vary; (1) from simplicity (no differentiation, no inte- 
gration) to complexity (both differentiation and integration) by 
way of differentiation and no integration, or integration but no 
differentiation, (2) in terms of its perceptual or executive function, 
and (3) in the social vs. non-social characteristics. If no other char- 
acteristics would need to be added (it may at some point become 
necessary to establish distinctions between complexity as a style, a 
preference, an ability, and a cognitive motive (cf. Driver and 
Streufert, 1964, 1966) and to distinguish between an inflexible 
hierarchical integration as compared to a flexible and adaptable 
integration), we would require at least four different tests (or test 
components) to measure various degrees of differentiation and in- 
tegration: (1) a measure of perceptual social complexity, (2) a 
measure of executive social complexity, (3) a measure of percep- 
tual nonsocial complexity, and (4) а measure of executive non- 
social complexity. 

A good deal of recent research on differentiation and integration 
has utilized the sentence completion test (sometimes also called 
paragraph completion test) (Schroder and Streufert, 1963; Schro- 
der, Driver and Streufert, 1967) as the predictor of individual dif- 
ferences in perception or performance, (e.g., Driver, 1965; Sieber 
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and Lanzetta, 1964; Stager, 1966; Streufert, 1966; Streufert and 
Driver, 1965; Streufert and Schroder, 1965; Streufert, Suedfeld 
and Driver, 1965; Suedfeld and Streufert, 1966; Tuckman, 1964). 
Although this measure has been repeatedly validated, it has in- 
herent problems which make it less useful for the general researcher 
in psychology and education: (1) the test can only be scored after 
lengthy training by a few qualified researchers. Some students never 
reach the desired interrater reliability. (2) test-retest reliability of 
the test varies considerably with samples and scorers (cf. Reed, 
1966). The absence of alternate forms, makes experimentation 
with changes in complexity after training a rather difficult task. 
Further this test may largely reflect remembered or already stored 
integrations. It does not require S to integrate information in his 
current style. Consequently, the sentence completion test may be 
out of date, or worse, confounded by social desirability (i.e., parrot- 
ing of rote learned complex responses). (3) finally, and probably 
most important, the test probably measures a composite complexity 
which does not permit an analysis of the different forms of com- 
plexity that have been discussed above. Much less does the test 
assist in the discovery of more, or more precise, components of 
complexity. 

In this paper, an alternative is presented to the sentence comple- 
tion test, which was specifically designed to measure one kind of 
complexity only: perceptual social complexity. This measure is 
called the Impression Formation Test. In its original form it was 
first proposed by Streufert and Schroder (1963), but has since 
undergone considerable alteration in both administration and anal- 
ysis. 

The Impression Formation Test. As the sentence completion 
test, the impression formation test is subjective in nature? The 
origin of impression formation derives from work of Asch (1946) 
with primacy and recency. Luchins (1948) and Gollin (1951) in- 
dicated that individual differences in responses to impression for- 
mation tests are noted in the way subjects form impressions from the 
contradictory information provided in the Asch task. For instance, 
subjects, who are asked to describe a person who is “intelligent, 
industrious and impulsive,” another person, who is “critical, stub- 


з Work on objective multiple component measures of complexity is in 
progress. 
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born and envious,” and who are then asked to describe a person to 
whom all adjectives apply, sometimes make use of all adjectives 
for the last description, and sometimes utilize only some of them. 
Gollin stated, that in his research some subjects formed impressions 
which integrated and related the conflicting themes, while others 
formed impressions which were an unintegrated aggregate of inde- 
pendent themes, while still a third group of subjects formed im- 
pressions which contained only the nonconflicting parts of the infor- 
mation. On the basis of this information, Streufert and Schroder 
(1965) developed the first version of the impression formation test 
as a measure of an integration-nonintegration dichotomy. This test 
correlated with the sentence completion freshman test, r = +.78 in 
a sample of 86 college freshmen. Further correlations between the 
two tests, which have been obtained since, have varied from +.26 
to +.88 with*a median correlation of +.52. In its more recent form, 
the impression formation test may be used with the following 
adjectives: 


(1) intelligent-industrious-impulsive—critical-stubborn-envious. 
(2) nervous-obstinate-jealous—teliable-sociable-independent. 
(3) attractive-cooperative-aware—self centered-aggressive-im- 
polite. . 

(4) distinguished-educated-outgoing—tyrannical-careless-forget- 
ful. 

(5) polite-friendly-superior—vain-sarcastic-destruetive, 


In all cases the subject is given a specified time (the length varies 
from 1-1/2 to 3 minutes, depending on the educational level of the 
subject) to describe a person to whom the first set of three adjec- 
tives would apply (e.g. nervous, obstinate, jealous). He then has 
to repeat the task with the second set of adjectives (e.g., reliable, 
sociable, independent). Finally he is asked to describe a person 
to whom all six adjectives would apply. (Here the subject is given 
one minute more than in the first two responses to make the de- 
scription.) Scoring of the response for complexity relies primarily 
on the last (combined adjectives) description. 

If more than one set of adjectives is used, or if test-retest relia- 
bility is to be established, only the combined set of adjectives need 
be used, if the subject has once been presented with the three adjec- 
tives—three adjectives—six adjectives sequence. Parallel form and 
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test-retest reliabilities for the adjective combinations are presented 
in Table 1. 

Scoring of the Impression Formation Test. (a) Scoring for in- 
tegrative complexity. Scores given to responses of subjects are de- 
termined from the degree to which these responses show differentia- 
tion and integration. Scores for integrative complexity vary from 
score 1, (low complexity) to score 5, (high complexity). The 
writers will diseuss the basis for these scores in some detail below. 
Examples given will be drawn from responses of subjects to the first 
set of adjectives. 

Score 14. The response shows inability to deal with conflicting in- 
formation. The conflict is negated, rather than resolved, usually 
by omission of the implications of incongruent adjectives. There is 
no evidence of differentiation or integration. 

Example: “A person who is critical and stubborn 1s probably a 
low paid clerk in a government position. He has not been advanced 
recently, and is therefore jealous and has a great distaste for all 
who have done better than he has. He hates his work. He mistreats 
all who work under him. He could not be intelligent.” 

Score 2,. The response shows an inability to apply conflicting in- 
formation to a single setting or a single point in time. Some evi- 


TABLE 1 


Tntercorrelations and Test-Retest Reliabilities for Five Sets of Impression 
Formation Adjectives* 


Adjectives Set Sentence 
Completion 
1 2 3 4 5 Test 


*Correlations in the diagonal represent test-retest reliabiliti i inistrati 
items Y ities obtained through administration 
of al ene five weeks apart. Correlations to the left of the diagonal are integration scores. Cor- 
ons right of the diagonal are differentiation scores. All correlations are positive. 
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dence of differentiation is present, but there is no evidence of in- 
tegration. However, some attempt is made to deal with more than 
one kind of information. 

Example: This is a character who is an excellent colleague at work. 
His intelligence and hard work have helped the business. His quick 
decision making is often an advantage. However, when he comes 
home to his family, he becomes intolerable and takes all his troubles 
out on them. 

Score 8. The response indicates an ability to bring at least one of 
the incongruent adjectives into the description. Differentiation may 
or may not be relevant, some integration is present. 

Example 1: It sounds like my uncle J. K. He is very intelligent, 
by all means, and he uses his intelligence to be critical, terribly 
much so as a matter of fact, to defend his view. Even if he is wrong, 
he is stubbosn and righteous. (This example indicated little dif- 
ferentiation.) 

Example 2: Teachers are often that way. If you meet them in 
the classroom only, they seem to have all the good qualities. They 
are bright, work hard, and quick thinkers. But when they are to- 
gether with some people who are for them and some who are 
against them, where they have to defend themselves, for instance in 
the PTA, they can show quite some stubbornness together with 
those other qualities. (This example indicates both some differenti- 
ation and minimal integration.) 

Score 4; The relationship between all adjectives is established, 
yet it remains based on surface events or relationships rather than 
on an (empathic) understanding of the underlying motivation pro- 
ducing the characteristics described in the set of adjectives, Differ- 
entiation may or may not be present, integration is present, but 
is not of a high level. 

Example 1: He is a real organizer. What he does depends on 
what he wants to achieve. When he wants people to do something, 
he talks them into it. He persuades them with quick thinking, and 
he pushes hard. Anyone who disagrees is immediately put into his 
place. If someone wants to outdo him, he quickly knocks him off, 
and then he goes on until he succeeds. (This example shows mod- 
erate differentiation.) 

Example 2: She is a real person, like all of us. She is in college, 
she works hard, but she does not let others disturb her, She has the 
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ability, and she knows it. She has high plans for the future, and she 
knows she will get there only if she makes the best grades. So she 
does. She watches out for the faults of others, and she capitalizes on 
them. She knows how to make use of all these abilities, and she 
will go far. (This response shows high level differentiation.) 

Score 5, The relationship between all adjectives is established, 
and based on an (intellectual or emotional empathic) understand- 
ing of underlying personality components and motivational factors. 
The persons described emerges as a real «уе? human being with- 
out inconsistencies. Both differentiation and integration are very 
high in this response. 

Example: This man is an executive in a successful medium size 
company. He has been driven by the need to succeed, and he has 
used his intelligence to his advantage. His hard work and quick 
reactions have let others overlook, or even like his critical manner. 
After all he usually is right when he criticizes. He has made it up 
the ladder by working day and night, and by outdoing everyone 
around. Now he is at the top, but he will not be satisfied till his 
company is the largest in the field. 

(b) Scoring for differentiation complexity. Whereas five scores 
were needed to describe a subject’s integrative complexity, distinc- 
tions at this point are only sufficiently fine to distinguish between 
four levels of differentiation complexity. 

Score 14. The structurally simple person is neither able to dif- 
ferentiate nor to integrate. Consequently, description and example 
of score 1; applies here as well. Example 1, item 3; applies also. 

Score 24. The level of differentiation here is rather low. Usually the 
perception of the adjectives is strictly relegated to different situa- 
tions or to different points in time—usually not more than two in 
number. The examples for score 2; and 3;, Example 2 apply. 

Score За. The response of the subject here shows moderate dif- 
ferentiation. The implication of the adjectives is seen from a num- 
ber of points in time or space, or from two or three points of view. 
Example 1, score category 4; is relevant. 

Score 44 The response shows high differentiation capacity. This 
can be shown in at least two different ways. (1) a highly inte- 
gr ated response in nearly all cases requires high differentiation (the 
exception occurs when the differentiation work has been performed 
for a subject by another person). One can therefore imply that high 
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differentiation has occurred when a score of 5, is given. (High 
level integration often obscures the cues of differentiation.) Exam- 
ple 5; is relevant. (2) different views or implications are visible, 
often combined with more than one time or space description. In- 
tegration may or may not be present. An example for such a 4; re- 
sponse with integration present can be found in Example 2 for score 
4. 
An example for а low integration (2;) high differentiation (54) 
response is given below: 

Example: Such a person is very adaptable. He shows a different 
face to everyone. For his boss he works hard. For his children he is 
a good teacher, He is envious of the money his neighbors have. 
When he makes decisions he is fast and he sticks to them. But he 
always wants to be right. 

Impression formation and other measures. How does the impres- 

sion formation test relate to other measures? Its relationship to the 
sentence completion test has already been discussed. Some other 
relationships to personality tests and intelligence teste drawn from 
a sample of 124 eastern college males are presented below: 
IQ (College Entrance Examination) V, r — 22; Q, т = —.09; 
Academic Achievement, т = —.04; F Scale r = —.18; Dogmatism, 
r= —.06; Rigidity, г = —.28; Social Desirability (Edwards), r = 
.21; Verbal Fluency, r = .01. 

A comparison of sentence completion and impression formation. 
The value of the impression formation test may be ascertained in 
a number of ways. First of all, is it reliable? Both test—retest re- 
liabilities and alternate form reliabilities presented in Table 1 
would strongly support the value of the instrument. Second, is the 
test more useful than the sentence completion test? To answer that 
question one should first consider the intercorrelations of the al- 
ternate forms of the impression formation test as compared to their 
correlations with the sentence completion test. Again, one finds 
that the result supports the impression formation concept as a uni- 
tary measure (cf. Table 1). But what about validity? The sentence 
completion test has been validated in many research settings (see 
above). Can the impression formation test improve on the predic- 
tive value of sentence completion? The writers have stated above 
that the impression formation test is specifically designed as a 
measure of perceptual social complexity. One should, therefore, 
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compare data derived from subjects selected on the sentence com- 
pletion and impression formation tests who are performing in а 
perceptual social task. 

Streufert and Driver (1965) report an experiment in which sub- 
jects took part in a simulated decision-making environment. Highty 
subjects were selected on the basis of high or low scores on the 
sentence completion test. The impression formation measure was 
used only as a secondary check. Twenty teams of four persons 
each, homogeneous in complexity levels (as determined by both 
tests), were formed. The teams were placed in the Tactical Game 
(Streufert, Clardy, Driver, Karlins, Schroder and Suedfeld, 1965). 
Each team was given instructions to make decisions regarding the 
invasion of a mythical island. Subjects were under the impression of 
playing against another team. Subjects were exposed to seven 
one-half hour playing periods. Each period differed from all others 
in environmental input; i.e., Ss were presented with either 2, 5, 8, 
10, 12, 15, or 25 independent single informative statements. The 
order of these “information load” periods was varied at random. 
Streufert and Driver were concerned with a test of the Schroder, 
Driver and Streufert (1967) theory of human information process- 
ing in the realm of interpersonal perception. Schroder et al. have 
postulated a family of inverted U shaped curves relating environ- 
mental input to differentiation and integration in behavior (see 
Figure 1). These authors propose that differentiation and integration 
in cognitions increases with increasing environmental input until 
an optimal information processing level is reached. If environmental 
input increases further, the level of differentiation and integration 
should begin to decrease. Individual differences in complexity should 
result in differing levels of curves relating environmental input 
(plotted on the abscissa) to differentiation and integration in per- 
ception (plotted on the ordinate), so that an entire family of 
curves could be produced, each representing a particular individual 
complexity level. These theorists also propose that points of optimum 
information processing should differ for persons of low complexity 
as compared to persons of high complexity. Persons scoring higher 
on measures of complexity (sentence completion or impression 
formation) should perform optimally at higher points on the en- 
vironmental input dimension, 

The research of Streufert and Driver (1965) supported all pro- 
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Figure 1, Degree of flexible differentiation and integration in perception and 
performance as a function of information load. 


positions but the last Differentiation and integration in percep- 
tion (perceptual complexity) were highest at the same point of in- 
formation load, no matter whether Ss were simple or complex. 
Mean curves for both groups of Ss reached a peak at 10 items of 
independent information per one/half hour period. However, a 


4 Differentiation and integration in perception (perceptual complexity) was 
measured as follows: During intermissions (after each period of different 
environmental input-information load-conditions) subjects had to fill out 
forms asking the following questions: (1) describe the strategy which you 
believe the enemy team is using, (2) how does this strategy relate to enemy 
activity, (3) is there any effect of your actions on enemy strategy, (4) what 
has made the enemy’s strategy appear reasonable to him, and (5) comment 
on the possible consequences of the enemy’s strategy. for your side and for 
his side. 

A number of scoring categories for differences in perceptual complexity 
were developed from the theory of Schroder et al (1967). Ss received a score 
of 1 for any response rated by four raters (inter-rater reliabilities varied from 
+.89 to +.96) as (1) cause-effect descriptions involving interrelated actions 
of Ss own and the opposing team, (2) inferences involving three or more 
steps, (3) descriptions of long term intentions of the opposing team, (4) 
responses generally identified as “empathic.” Scores for each subject were 
added for each period different in information load. 
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closer inspection of the results indicates that for simple Ss the level 
of perceptual complexity of information load 8 is most similar to in- 
formation load 10, while for complex Ss the most similar informa- 
tion load to load 10 is load 12. One may now ask: Is the absence of 
differences between complex and simple subjects in optimal per- 
ceptual complexity due to theoretical error, or due to measure- 
ment problems? 

If the impression formation measure is to be of value, it should 
predict perceptual complexity considerably more accurately than the 
sentence completion test. To test this assumption, the Streufert and 
Driver (1965) experiment was replicated with ten four-man groups 
of Ss scoring low on the sentence completion test, ten groups of 
subjects scoring high on the sentence completion test, ten groups of 
subjects scoring low on the impression formation test and ten 

а-----—щ@ Simple Ss., selected with the Sentence Compl. Test (SCT) 
W------ W Complex Ss., selected with the SCT. 

о——————9° Simple Ss., selected with the Impression Formation Test (IFT) 
4— — —— 9 Complex Ss., selected with the IFT. 


ON 
= 


MEAN NUMBER ОЕ CATEGORIES OF 
PERCEPTUAL DIFFERENTIATION - INTEGRATI 
e чо 0 
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Figure 2. Mean number of categories of : iati A 
А ; E perceptual differentiation and integra- 
tion as a function of information load for subjects selected as "simple" or 


“complex” on the Sentence Completi A 
Formation Test (IFT). pletion Test (SCT) and the Impression 
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groups of subjects scoring high on the impression formation test. 
A comparison was made of low and high scores on each test. The 
results are presented in Figure 2. A Lindquist Type 1 analysis of 
variance resulted in a significant main effect, for information load 
(p<.01) and for complexity (p<.01) but no significant interaction 
effect when persons high on sentence completion test were com- 
pared with persons scoring low on that test. If, however, low scorers 
were compared with persons scoring high on impression formation, 
significant main effects were found for information load (p<.01) 
and complexity (p<.01) and an interaction effect (p<.05). A view 
of Figure 2 does indicate that differences between simple and 
complex persons in optimal perceptual complexity points did occur. 
Simple Ss reached a peak at information level 8 while complex per- 
sons attained a peak at information level 12. 

These results tend to support the value of impression formation 
as a predictor of interpersonal perceptual complexity. 
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COMPARABILITY AND VALIDITY OF THREE FORMS OF 
SCAT 


THOMAS M. GOOLSBY, JR. 
The Florida State University 


Tur School and College Ability Tests (SCAT) have been widely 
used at all grade levels. Statements concerning validity of the vari- 
ous forms of SCAT seem to be limited and vary in the literature 
published by the authors of the tests. The most frequently noted 
statement is that “the forms were designed to aid in estimating the 
capacity of a student to undertake the next higher level of school- 
ing.” 

The present study was designed (1) to determine the compara- 
bility of two forms of SCAT at Level 1 (Form 1C and Form 1D) as 
well as of a third form specifically constructed for inclusion in an 
achievement test battery and (2) to determine certain validity co- 
efficients when some major criterion variables are considered. 

Procedures. Form 1C or Form 1D was administered to entering 
freshmen at a large Southeastern University in September, 1966. 
Alternate Ss in each testing room responded to the items in Form 
1C or Form 1D. Other data collected were Florida Twelfth Grade 
Achievement Tests (FTGAT) scores in the Fall of 1966 (which 
included as a subtest a specially-designed form of the SCAT pre- 
pared by the Educational Testing Service) and Freshman Grade 
Point Average (FGPA) in January, 1967. 

In addition to the determination of means and standard devia- 
tions for comparative purposes, intercorrelations were calculated 
to determine certain measurement characteristics of SCAT and 
ЕТСАТ and to ascertain the utility of SCAT as a predictor of 
certain criteria and as a correlate with certain variables, including 
Freshmen Grade Point Average (FGPA). 
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Kuder-Richardson Formula 20 reliability coefficients of .90 or 
higher are reported for SCAT and FTGAT. 

Results and discussion. Table 1 presents the means and standard 
deviations for Form 1C, Form 1D, and FTGAT. Visual inspection 
of the data presented in Table 1 might lead one to conclude that 
Form 1C and Form 1D are comparable. The differences in means 
for groups responding to the items in Form 1C and Form 1D are not 
significantly different at the .01 level of confidence. The differ- 
ences in means for FTGAT for those groups responding to Form 
1C and Form 1D are not significantly different at the .05 level of 
confidence. 

Table 2 presents the correlations between certain forms and/or 
subtests of SCAT. The correlation of SCAT-FTGAT with SCAT-1C 
and SCAT-1D are respectively .64 and .40. Even though the 
means and standard deviations presented in Table 1 might lead one 
to conclude comparability or equivalency for Forms 1C and 1D, the 


TABLE 1 


Means and Standard Deviations for SCAT-Form 1C (N = 301), 
SCAT-Form 1D (N = 299), and FTGAT 


Form 1С* Form 1D 

SCAT V 41.79 8.44 41.25 7.46 

SCAT Q 3924 617 40.42 6.49 

SCAT T 81.03 11.45 81.67 10.89 

FGPA 2.45 74 2.36 .81 
FTGAT*: 

SCAT-FTGAT* 91.40 12.76 91.76 17.24 

English (EN) 61.64 10.72 61.30 13.08 

Social Studies (SS) 47.59 8.81 47.61 9.71 

Natural Science (NS) 38.13 8.90 38.12 9.61 

Mathematics (MS) 38.31 9.89 38.38 10.34 

"Total (12th Grade) 231.37 37.00 230.79 44.44 


SCAT-FTGAT-— The School and College Ability Test included as а subtest 
M in the Florida Twelfth Grade Achievement Test. 
— School and College Ability Test, Verbal Subtest 


SCAT Q —School and College Abilit; itati 
y Test, Quantitative Subtest 
SCAT T  —School and College Ability Test, Total Test 
FGPA —Freshman Grade Point Average 
FTGAT —Florida Twelfth Grade Achievement Test 


^1C or 1D will denote a specific group of subj: a 5 ч. Е f 
SCAT at a single testing session in CERE ie responding to items in respective forms © 


ЪАП subjects responding to Form ІС + К 
approximately one year previous, or Form 1D had also responded to the items in FTGAT 


*The form of SCAT included as a subtest of FTGAT is reported i ‘esting Servi 
Princeton, New Jersey, to be a meekly designed ate ih ouem 3 
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TABLE 2 
Correlations between Certain Forms and/or Subtests of SCAT 


SCAT-FTGAT SCAT-Q. 


SCAT-1C 64 
SCAT-ID 40 
SCAT-V 21-10 


_ м срт 


difference in the magnitude of the relationships of .64 and .40 pre- 
sented in Table 2 strongly suggests that Form 1C and Form 1D are 
substantially different in some unknown respect. 

Included in Table 2 is the relationship between SCAT-Q and 
SCAT-V for groups responding to Form 1C and Form 1D and are 
both of the magnitude of .21. This magnitude of relationship indi- 
cates substantial independence of the two subtests of SCAT in both 
forms. 

The interrelationships among the subtests of FTGAT presented 
in Table 3 range from .44 to .75 and are considered to be very 
acceptable relationships for an achievement battery of this type. 


TABLE 3 
Interrelationships among the Subtests of FTGAT 


SS NS MS 
EN .58-1C 48-10 44-10 
-70-1D .57-1р .56-1р 
SS .70-1С .54-1C 
.751р .65-1D 
NS ‚68-10 
.T4-1D 


— o o ТА s о ес сс 
Most of the correlations of SCAT-FTGAT with achievement 
subtests of FTGAT presented in Table 4 might be considered sat- 


isfactory with few exceptions, but when considering total FTGAT 
TABLE 4 
Correlation of SCAT-FTGAT with Achievement Subtests of FTGAT 

SCAT — FIGAT 

N .67-10 тт1р 

88 талс .76-1D 

NS 714€ 170-10 

MS 65-10 1721р 
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the relationship is of such magnitude as to cast serious doubt about 
the desirability of including the ability test in the FTGAT. 

In view of the very high relationship (.89 and .91) of total 
FTGAT with SCAT-FTGAT, the relationships presented in Table 
5 are of interest. Relationships between very extensive achieve- 
ment measures and ability measures have been found to be as high 
as .85, but rarely higher. At the .85 level and, especially in the 
case of the .89 and .91 levels, the apparent extensiveness of the 
FTGAT might allow one to consider it a very good substitute for 
an ability test, Then it might be said that the data presented in Ta- 
ble 5 suggest that ability measured at substantially different times 


TABLE 5 
Correlations of SCAT-1C and SCAT-1D with Achievement Subtests of FTGAT 
SCAT-1C SCAT-1D 
Ку Qus T NONO. T 
EN „5.38.8 .41 .10 .84 
58 .54 .32 .57  .44 .13 .38 
NS .44 .51 .60 .40 .33 .47 
MS .29 .69 .59 .26 .53 .50 
Total .57 .53 .10 .44 .30 .48 


(one year or more apart) is defined by different arrays of vari- 
ables according to the SCAT tests designed to measure the entity, 
ability. This observed effect could be due to difficulties in designing 
measures of "latent" and distinct ability factors attributable to 
development and/or instruction. This inference is also supported 
by the correlations of .64 and .40 in Table 2. 

Table 6 indicates something of the predictive power of SCAT 
when FGPA is the criterion. The highest relationship observed be- 
tween FGPA and any of the forms of SCAT was 40. Even though 
all of the relationships presented in Table 6 may be significantly 


different from zero, they have relatively little practical value for 
predictive purposes. 


TABLE 6 i 
FGPA Correlations with Various Forms of SCAT 
a Predictor Variable 
Criterion SCAT-1C SCAT-1D SCAT-FTGAT 


Variable Y. T тот 
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Table 7 indicates something of the predictive power of FTGAT 
when FGPA is the criterion. Again, these relationships have little 
practical value for predictive purposes. 


TABLE7 
FTGAT Correlations with ЕВРА 

D 

3 FTGAT (Predictor Variable) 
T Е EN SS NS MS TOTAL 
>_< —— oir bia 
98 10 .31 .33 -26 27 -34 

EB 1D .29 24 .31 .31 .33 


oo o a 


FGPA—Freshman Grade Point Average. 


When certain subtests of FTGAT were combined in the mul- 
tiple prediction procedure as indicated in Table 8, the predictive 
efficiency improves very little. 


TABLE 8 
FTGAT Multiple Correlations with FGPA 


Predictor Variables FGPA (Criterion Variable) 
1C 1D 
EN +88 .86 .29 
EN + SS + MS .37 .34 
BS + NS 33 .31 
SS + NS + MS .35 .34 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1967, 27, 1047-1054. 


PREDICTIVE VALIDITY OF THE METROPOLITAN 
READINESS TESTS AND THE MURPHY-DURRELL 
READING READINESS ANALYSIS FOR WHITE AND FOR 
NEGRO PUPILS 


BLYTHE C. MITCHELL 
Test Department, Harcourt, Brace & World, Inc. 


Somm doubts have been expressed about the usefulness of the 
usual readiness tests with pupils of certain racial and ethnie groups, 
and with pupils from families who are at the lower socio-economic 
levels. The use of two readiness tests in the USOE Cooperative 
First-Grade Reading Study of 1964-65 has provided data for an 
investigation of the predictive validity of these tests with groups 
differentiated according to the above characteristics. The data 
reported here provide a comparison of the predictive value of the 
tests for the white and the Negro pupils in the total population 
tested. Data differentiated by socio-economic levels will be re- 
ported in a subsequent article. В 

Data. In October of 1964, а large number of pupils in 27 pro- 
jects throughout the country were administered a number of tests re- 
lated to the several proposed studies of various aspects of reading. 
Among these tests were the Murphy-Durrell Reading Readiness 
Analysis (1964 Revision) and the Metropolitan Readiness Tests 
(1964-65 Revision), Form A, both of which were given to the 
12231 of these first-grade pupils judged to provide a representative 
national sample. In May of this 1904-65 school year 9497 of 
these pupils took the reading and spelling subtests in the Stan- 
ford Achievement Test, 1963 Revision, Primary L, Form X. 

Pupil population. Information as to racial-ethnic group was pro- 
vided in some, but not all, of the projects and showed the following 
pupil count: 
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White 7310 
Negro 518 
Mexican 139 
Oriental 37 
Puerto Rican 15 
Eskimo or Indian 5 
Total known 8024 
No information 1473 
"Total 9497 


The 7310 white pupils were from 13 different area projects; the 
518 Negro pupils came from 11 of these 13, with the greater pro- 
portion from a single urban school system in North Carolina. Median 
CA of the white pupils at the time of testing in early October was 
6-3.8; of the Negro pupils 6-5.4. The white pupils entered first 
grade with an average (median) of 72 days of pre-first grade school- 
ing, Negro pupils with 46. Length of the school year was 181 days 
for the white pupils, 179 for the Negro; the white pupils' school day 
averaged 5.2 hours, the Negro pupils’ 6.2 hours. Size of classroom 
averaged 27 for the white pupils, 32 for the Negro. 

Tests. The Metropolitan Readiness Tests have a total of 102 
items in six subtests; the Murphy-Durrell Reading Readiness Analy- 
sis is made up of three subtests with a total of 118 items. The com- 
position of the two instruments is as follows: 


Murphy-Durrell 
Metropolitan Readiness Tests Reading Readiness Analysis 
" Number Number 
Subtest of Items Subtest of Items 
Word Meaning 16 Phonemes 48 
Listening 16 Letter Names 52 
Matching 14 Learning Rate 18 
Alphabet 16 
Numbers 26 Total 118 
Copying 14 
Total 102 


The criterion (achievement) test, the Stanford battery (1963 
Revision), has the following distribution of items: 


Word Reading . 85 Vocabulary 39  Speling 20 
Paragraph Meaning 38 Word Study Skills 56 (Arithmetic 63) 
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The Arithmetie test was not used as a part of the Reading Study. 

The reliabilities of the two readiness tests have been determined 
for a number of different groups. Split-half values (corrected by 
the Spearman-Brown formula) range from .52 to .97 for the nine 
separate subtests, from .90 to .98 for total score. Test-retest (Form 
A-Form B) reliabilities are available for the Metropolitan tests 
only, with the median of determinations for pupil groups in four 
different school systems ranging from .40 (Listening) to .84 (Al- 
phabet) for single subtests. The four test-retest coefficients for total 
score are .89, .89, .93 and .93. 

Results. Correlations of the Readiness subtest and total scores 
with raw score of each Stanford subtest, computed separately for 
white and Negro pupils, are shown in Table 1. 

The results shown in Table 1 would not support the hypothesis 
that the Metropolitan or the Murphy-Durrell tests have lower 
predictive validity for Negro pupils than for white pupils. There 
are differences (some significant) in both directions in the observed 
correlations with the achievement criterion. For the 45 comparisons 
between the values found for the two groups, however, that for 
Negro pupils is higher in 26 instances. 

In this comparison of the predictive validities it is important to 
consider the relative size of the standard deviations for the white 
and the Negro groups. Those for the readiness tests, shown in the 
final column of Table 1, indicate that these differences operate in 
both directions, with the scores of the Negro group being less vari- 
able than those of the white group for five of the nine readiness 
subtests. For total score on both readiness tests there is a very slight 
difference in the standard deviations of the Negro and the white 
pupil groups, but it is in a different direction for the two tests (15.8 
and 16.6 on Metropolitan, 26.7 and 25.5 on Murphy-Durrell). 
On the criterion, the Stanford test, the scores of the Negro group are 
less variable than those of the white group on all four reading sub- 
tests, the difference in Paragraph Meaning being marked (Negro 
8.4, white 9.7); for Spelling there is a slight difference in SD's, 
with that of the Negro group being slightly higher (6.4 and 6.1). It 
would seem, then, that differences in variability, insofar as they 
may have been a factor in the size of the obtained correlations, 
would have acted more often to depress the validities for the Negro 
group than for the white group. Had the Negro standard deviations 
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been comparable with those for the white group, the slight trend 
for the Negro-group correlations with Stanford to exceed those for 
the white group might be even more noticeable. 

Failure to find lower relations between readiness tests and a cri- 
terion test for Negro groups in the present study confirms that 
found by the writer for the earlier (1949) edition of the Metro- 
politan Readiness Tests (Mitchell, 1962). With all pupils in the 
white and the Negro schools of a Virginia county included in this 
study, the correlations between total score in September on the 
Metropolitan Readiness Tests, 1949 Edition, Form R, and the four 
subtests of the Metropolitan Achievement Tests, 1959 Revision, 
Form A, were as follows: 


Metropolitan Achievement Tests 
Word Word 
N Knowledge Discrimination Reading, Arithmetic 
White pupils 919 56 56 51 .63 
Negro pupils 251 55 AT 48 .62 


Only one of the differences, that between .56 and 47 for Word 
Discrimination, is significant at the .05 level. 

Additional comparisons. The 1964-65 First-Grade Reading 
Study included a few members of other ethnic groups. Table 2 gives 
the correlations between total Readiness score in October and 
reading achievement the following May for the above white and 
Negro groups, and for two much smaller groups, one of Mexican, 
one of Oriental children. Also shown are the correlations for the 
1473 children whose racial or ethnic group was not indicated on 
the score reports. It is noted that the generally higher validities 
found for this group are probably due in part to the greater varia- 
bility of its score distributions. The final row in each section of the 
table gives the correlations for the combined group of 9497 pupils. 

Conclusion. From the data of this study it would appear that the 
two readiness tests considered, the Metropolitan and the Murphy- 
Durrell, perform their function as well with Negro pupils as with 
white pupils, and that the general level of predictive validity is sim- 
ilar for four racial-ethnic groups studied. 
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FURTHER VALIDATION OF THE KAHN 
INTELLIGENCE TESTS! 


WILLIAM T. CARSE 
University of Texas 


Tue prediction of school success continues to pose problems for 
educators at"all school levels. These problems are enlarged when 
certain sub-groups within the school population are considered. 
The measurement of intelligence, which has served as a basic pre- 
dictor of academic success, has been greatly hampered by the vary- 
ing cultural backgrounds of children (Crow, Murray, Smyth, 
1966). Many efforts to adequately measure intelligence of all 
groups have been made by psychologists. Among the scales de- 
veloped is the Kahn Intelligence Tests: Experimental Form (KIT: 
EXP) which has the announced purpose of creating a less culture- 
bound device than has been available (Kahn, 1960b). 

The Kahn Intelligence Tests: Experimental use sixteen (16) 
plastic objects (butterflies, dogs, hearts, stars, a cross, a parrot, an 
anchor, a circle, and a segment of a circle) and a strip of felt with 
numbered squares. The nature of the tasks allows measurement in 
most recognized areas of intellectual functioning with a minimum 

. of verbal responses. The KIT:EXP is reported to yield more valid 
measures of intelligence of children from culturally-different homes 
than have any previous instruments (Kahn, 1960b). This study at- 
tempts to determine the validity of this statement when these cul- 
turally different children reside in an isolated rural area by com- 
paring the mental ages and intelligence quotients obtained from the 
KIT:EXP and the Revised Stanford-Binet Intelligence Scales, Form 
L-M. A secondary purpose was to examine basal and ceiling ages 


1 The research reported herein was partially supported by a faculty research 
grant from the University of Kentucky Research Foundation. 
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together with the item responses to determine any observable 
differences in types of responses. 

The KIT: EXP has a reported reliability of .94 and a correlation 
of .75 with the Stanford-Binet Intelligence Scales, Revised 1937 
Edition (Kahn, 1960a). McDaniel and Carse (1966) reported a 
correlation of .83 between the two tests (Stanford-Binet, Form 
L-M) when the intelligence quotients were obtained from children 
living in urban areas and from homes where the fathers were in 
managerial and professional occupations. 

Procedures. The two intelligence scales (KIT:EXP and Stan- 
ford-Binet, Form L-M) were administered to fifty-two (52) chil- 
dren (33 boys and 19 girls). The children were tested in an un- 
planned sequence; thus, some of them completed the KIT:EXP 
first and others the Stanford-Binet. Two examiners did the testing. 
Each administered 26 Stanford-Binet and 26 KIT:EXP tests. 
The obtained mental ages and intelligence quotients were used for 
the computation of correlations and of differences between means 
as well as for a comparison of basal and ceiling ages. 

Sample. These fifty-two children, ages 63 to 78 months (average 
73.8), who were all residents of Martin County, Kentucky, were 
enrolled in the summer 1965 Headstart Program. This county, in 
extreme Eastern Kentucky, had a population of 12,201 in 1960. Of 
these 9,416 were classified as rural nonfarm, 785 rural-farm with no 
urban population. The population has steadily decreased since 
1950 with the greatest loss being in the higher (for Martin County) 
income group. 

The county has no major industries. Transportation facilities are 
inadequate by general standards. The income is low—the per capita 
income in 1960 was reported as $361.00 as contrasted with the 
Kentucky average of $1,573.00. The median family income was 
reported as $2,071.00 for the same year with the Kentucky median 
being $4,051.00. Sixty-three per cent of the families had incomes 
of less than $3,000.00 (Brown and Ramsey, 1962). This low in- 
come allowed most children to attend the Headstart classes. 

Results. The means, standard deviations, correlations, and t-tests 
are shown in Table 1. The sample was studied as (1) the total 
group, (2) the boys (N = 33), and (3) the girls (V = 19). The 
correlations between obtained mental ages were .69 for the total 
group, .71 for the boys, and .60 for the girls. The correlations be- 
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TABLE 1 
Means, Standard Deviations, t-tests Between Mean Differences and Correlations 
Among Mental Ages and Intelligence Quotients with Chronological Age 
Averages for Total Group, Boys and Girls 


Standard t-test mean 
Means Deviations difference 
N SB* KIT:EXP SB* KIT:EXP r — SB-KIT 
Total Group 
Chronological Age 52 173.635 
Mental Age 52 71.05 70.37 9.11 9.49 .69 .66 
Intelligence 52 96.83 96.10 14.33 13.05 .76 54 
Quotient 
Boys 
Chronological Age 33 73.61 
Mental Age 33 72.78 71.58 9.68 8.98 .71 ° .95 
Intelligence Quotient 88 98.91 96.97 14.54 11.52 .87 1.52 


ri 

Chronological Age 19 73.68 

Mental Age 19 68.58 68.32 6.76 9.86 .60 .01 
Intelligence Qfotient 19 93.21 94.58 13.13 13.27 .70 —.59 


"The abbreviation SB is used to indicate the Stanford-Benet Intelligence Scale Form L-M. 
Chronological age is entered only once for each group. 


tween the obtained intelligence quotients were .70 for the girls, 
87 for the boys, and .76 for the total group. All correlations are 
significant. 

No significant differences between the means were found for 
any of the three groups. The greatest difference obtained was be- 
tween the boys’ mean intelligence quotients. The difference of 1.94 
was in the direction of the Stanford-Binet. 

The total group was divided at the median of the Stanford-Binet 
intelligence quotients. Means, standard deviations, correlations, 
and t-tests between means were computed for each group. The 
statistics are shown in Table 2. The mean for the high group (above 
the median) is 108.07 on the Stanford-Binet and 102.69 for the 
KIT:EXP. The low group (below the median) had a mean of 
85.58 for the Stanford-Binet and 89.19 for the KIT:EXP. 

The differences between means in both groups are significant 
with the difference in the direction of the Stanford-Binet for the 
high group and in the direction of the KIT:EXP for the low group. 
The correlations between the tests are .76 for the high group and 
.81 for the low group. 

The obtained basal and ceiling ages for the Stanford-Binet and the 
KIT: EXP are shown in Table 3. Although there is no direct com- 
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TABLE 2 


Means, Standard Deviations, t-tests, and Correlations for Intelligence Quotients 
Obtained from the High and Low IQ Score Groups with Chronological 


Age and Sex Distributions 
Standard t-test mean 
Means Deviations difference 
М SB КІТ:ЕХР SB KIT:EXP г SB-KIT 
High Group 26 108.07 102.69 10.41 12.47  .76 3.52 
Low Group 26 85.58 89.19 7.01 9.37 .81 -3.28 
Chronological age 
High Group 72.56 months 
Low Group 74.88 months 
Sex Distribution 
High Group Boys = 18, Girls = 8 
Low Group Boys = 15, Girls = 11 


parability between the two instruments because the KIT:EXP 
does not divide the lower years into 6 month intervals, it does ap- 
pear that the median basal and ceiling ages are lower on the KIT: 
EXP. However, the number of year tests used to test all children 
was ten for each instrument. The spread was greater for the Stan- 
ford-Binet on the ceiling ages because many children obtained one 
or two correct responses on the higher level Stanford-Binet tests, 
although there was a tendency for many children not to pass all 
KIT:EXP items included for the upper years. 

Discussion. The correlations obtained between the Standford- 
Binet and the KIT:EXP intelligence quotients for these children 
residing in an isolated Eastern Kentucky community were approxi- 


TABLE 3 
Obtained Basal and Ceiling Ages for Both Instruments 
Stanford-Binet KIT:EXP 
Number Number 
Year Basal Ceiling Year Basal Ceiling 
IV 2 I-II 1 
IV-6 23 II-IV 8 
v 24 IV-V 34 
VI 3 4 V-VI 9 
VII 15 VI-VII 8 
VIII 12 VII-VIII 24 
IX 17 VII-IX 9 
X 3 IX-X 4 
XI 0 X-XI 5 
XII 1 XI-XII 2 
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mately equal to those obtained by Kahn in the standardization of 
his tests (Kahn, 19602). Kahn reported a correlation of .75 which 
is smaller than the .85 obtained by McDaniel and Carse (1966) 
and larger than the .70 obtained here. This finding could be ex- 
plained by the nature of the backgrounds of the second two groups. 
The group tested by McDaniel and Carse were from an urban com- 
munity and managerial homes. This would seem to indicate that 
extensive word knowledge, as would be expected in the urban 
group, does help to increase the spread of scores and thus the 
correlation between the two tests. 

When the Stanford-Binet IQ's were divided at the median, a cor- 
relation of .81 was obtained between the two measures for the low 
group and .76 for the high group. 'The difference between these cor- 
relations was not significant. It did indicate, when the direction of 
the mean differences are considered, that the increased intelligence 
quotients obtained by the lower children on the KIT:EXP and the 
increased IQ obtained on the Stanford-Binet by the higher group 
tended to lower the correlation obtained when the total group was 
considered. In fact the correlations of .76 and .81 for the higher and 
lower groups lowered to .70 for the total group. The restricted range 
of scores would argue for decreased correlation for the two groups— 
this was not obtained. These correlations together with the signifi- 
cant mean difference seem to bear out Kahn's contention that his 
tests may be effective for certain marginal groups when word usage 
is considered. 

Since the KIT: EXP manual reports adequate reliability, and with 
the correlations obtained in this study, the indications are that the 
KIT:EXP does yield an adequate measure of intelligence. Perhaps 
the greatest factor in favor of the KIT:EXP was the greater ease 
of administration that both examiners noted when this group of 
children was considered. Greater interest was maintained when the 
KIT:EXP was given. This could be accounted for either by the na- 
ture of the items—the KIT:EXP allows the child to manipulate 
more objects—or the few verbal responses that are required. Ob- 
servations by the examiners indicate that it was the fewer verbal re- 
quirements of the KIT:EXP which allowed for a faster and more 
enjoyable testing experience. Kahn’s statement (1960a, p. 239) 
seems to hold: “The KIT:EXP may prove to have considerable 
immediate.value when the examiner has a “hunch” that the IQ he 
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obtains from another test is not an adequate estimate of the ex- 
aminee's capacity to function . . . to be used as a guide when the 
examiner requires an independent check on the hunch that the 
examinee would have scored significantly better if the testing could 
have been less dependent on cultural, educational, and verbal re- 
quirement.” The ease with which the KIT:EXP may be scored (not 
dependent in most tests upon the interpretation of verbal re- 
sponses) would probably make the KIT:EXP an excellent supple- 
mentary test for school counselors and psychologists to use when 
they suspect that the group intelligence tests are discriminating 
against the culturally different children in their schools. 
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A DEVELOPMENTAL SCALE TO ASSIST IN THE 
PREVENTION OF LEARNING DISABILITY” è 


SELMA G. SAPIR? 
AND 
BERNICE WILSON 
Scarsdale Publio Schools 
and 
* Teachers College, Columbia University 


Tus research is part of a pilot study which hypothesized that 
children with problems in perceptual-motor, bodily schema, and/ 
or language development can be identified by kindergarten age and 
then trained in areas of deficit, The Sapir Developmental Scale (c) 
(Sapir, 1966) was used to identify children with developmental lags 
at the kindergarten age. 

Levels of development have been considered good predictors of 
intellectual and academic growth (Gesell, 1925). Emphasis has 
been placed on perceptual-motor functioning (Silver and Hagin, 
1965), visual perception (Frostig, 1965), bodily schema (Kep- 
hart, 1960), language development (Deutsch, 1964), and sensory 
integration (Birch and Lefford, 1963), and their relationship to 
reading and intelligence. 

The following hypotheses will be tested; (1) developmental dif- 
ferences can be identified at the kindergarten level; (2) they tend 


iThis investigation is supported by Grant No. 6-8275, Department of 
Health, Education and Welfare, U. 8. Office of Education, Division of Handi- 
capped Children and Youth. 

2'The authors wish to express their appreciation to Mary Ransford, teacher, 
and Evelyn Stromberg, nurse-teacher, at the Quaker Ridge School, Scarsdale, 
New York, and to Dr. Stanley Lieberfreund, Dr. Paul Eiserer, and Dr. Miriam 
Goldberg of Teachers College for their help and guidance. 

3 Special gratitude goes to Dr. Arnold Gold of the Neurological Institute 
for all the hours of service he volunteered. 
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to persist in first grade; (3) developmental performance, as measured 
by the Sapir Developmental Scale (c) (Sapir, 1966), can predict 
academic performance at the end of first grade. 

Subjects. Ss were 54 children, 36 girls and 18 boys, aged five 
years one month to six years two months at the time of testing. 
The children were the total kindergarten population of a suburban 
primary public school in a high socio-economic community. 

Instrument. The Developmental Scale was constructed to reflect 
differences in developmental patterns of kindergarten children in 
three major areas: (1) perceptual-motor skills; (2) bodily schema; 
and (3) language development. Perceptual tests consist of visual 
discrimination, visual memory, auditory discrimination, auditory 
memory, and visual-motor skills, Tests of bodily schema include (1) 
directionality-laterality, (2) body image, and (3) visual-motor 
spatial relationships: Language development includes tests of con- 
cepts of orientation to the environment and vocabulary. 


The following items make up the Sapir Developmental Scale (с): 
І. Visual Discrimination 


8. Child asked to match 4 forms on a card with 6 forms on it. 

b. Child asked to determine likenesses and differences in 
pairs of words shown. 

c. Child asked to find imbedded triangles in complex designs. 


П. Visual Memory 


a. Forms shown for 5 seconds, then removed. Child told to 
identify form from 6 forms. 


b. Forms shown for 5 seconds. Child asked to reproduce 
from memory. 


III. Auditory Discrimination 
a. Child asked to determine likenesses and differences in 
pairs of words, 
IV. Auditory Memory 


8. Child asked to repeat series of numbers heard. 
b. Child asked to repeat backwards series of numbers heard. 
c. Child asked to repeat tapping rhythms. 


n 


Cc TR 
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V. Visual Motor 


a. Copy square, triangle, and diamond. 
b. Draw two horizontal parallel lines 14 inch apart. 
с. Draw circle and cut it out with scissors. 


VI. Directionality/Laterality 
a. Right-Left directionality: eyes open and eyes closed. 
b. Laterality with record of eyedness, handedness, and 
footedness 


VII. Orientation to Environment 


a. Child asked to answer questions about the relationships of 
size, time, movement in the environment. 


VIII. Visual-Motor Spatial Relationships 


a. Child asked to copy designs developed by White and 
Phillips (1964). 


IX. Body Image 


a. Draw-a-Person Test 
b. Child asked to identify parts of body. 


X. Vocabulary 
а. Child asked to define the first ten words of the Stanford- 
Binet Vocabulary subtest. Child’s definitions are analyzed 
by level of usage. 


Procedure. The Sapir Developmental Scale (c) was individually 
administered in January of the kindergarten year and data were 
recorded on standardized forms. Designs and words were standard- 
ized on 8-in. Х 11-in. oak tags, written with a black magic marker 
pen. One-half inch square blocks were used for counting. The 12 
White and Phillips (1964) designs, approximately 8-in. X 8-in. 
each, were on the blackboard for the child to copy. The examiner, 
sitting beside the child, explained that there were games to play and 
asked the child to sit at a desk next to the examiner two feet away, 
facing the blackboard. 

In the fall of first grade, approximately nine months later, the 
Marianne Frostig Test of Visual Perception was administered to the · 
same population. One month later the New York State Reading 
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Readiness Test was given. Subtests are as follows: word meaning, 
listening, matching, alphabet, numbers, and copying. 

One year after the administration of the Developmental Scale, 
an independent neurologieal survey was conducted in the publie 
School by a leading pediatrie neurologist. The children were seen 
individually in the school setting by the neurologist and they were 
rated neurologically on a 1-10 scale, 5 designated as borderline. 

The Stanford Achievement Test, Primary I Form W with subtests; 
word meaning, paragraph meaning, vocabulary, spelling, word 
study and arithmetic, was administered to this population in June at 
the completion of the first grade program. 

Results. Norms were established in terms of scores at the 70th 
percentile, and eighteen children placing below this percentile were 
judged with this scale to be developmentally deficient. These eigh- 
teen children scored below 61 on the scale (range 0-95) and re- 
vealed deficits in at least two of the three areas: bodily schema, 
perceptual-motor function, or language development. The range of 
scores in the total first grade population was 39-84 with "normal" 
population mean score, 69.06, and standard deviation, 8.41. For 
the "deficit" population, the mean score was 50.0, and the standard 
deviation was 9.03. For a t of 7.66, p was less than .001. Thus the 
difference between means of the two groups was significant beyond 
the .001 level. 

Thirteen of the children in this first grade population were di- 
agnosed by Dr. Arnold Gold, pediatric neurologist of the Neurologi- 
cal Institute to have “minimal cerebral dysfunction.” These 
thirteen children were among the eighteen children which the scale 
designated “deficit.” None of the “normal” children was so rated 
by the neurologist. Because two of the “deficit” children were ill, 
only sixteen of the eighteen were seen by the neurologist. Thus 
81.25 per cent (p<.001) of the “deficit” children seen were di- 
agnosed by Dr. Gold to have “minimal cerebral dysfunction.” 
The subtests of the Sapir Developmental Scale were able to dif- 
ferentiate children with “minimal cerebral dysfunction” from “nor- 
mal” children beyond the .001 level of significance. 

Table 1 shows a comparison of the “normal” and "deficit" popu- 
lation with the developmental scale, the New York State Reading 
Readiness Test, and the Marianne Frostig Test of Visual Percep- 
tion. The last two tests were administered eight months after the 
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TABLE 1 
Comparisons of Performance for 86 “Normal” and 18 "Deficit Children 
Test “Normal” “Deficit” 
Mean SD Mean SD t р 

Sapir Developmental Seale 69.06 8.41 50.00 9.03 7.66 <.01 
Sapir Perceptual-Motor 30.92 2.57 24.22 4.73 6.76 <.01 
Sapir Bodily Schema 24.75 6.92 14.22 5.94 5.52 <.01 
Sapir Language 13.39 1.27 11.56 1.92 4.20 <.01 
N.Y. State Readiness 77.72 7.90 62.72 7.73 6.62 <.01 
Marianne Frostig Test of 107.58 12.33 94.39 12.83 3.66 .01 


Visual Perception 


children had been so diagnosed with the Sapir Developmental 
Scale. This again validated the findings of the Developmental Scale. 
All tests showed signifieant differences between populations be- 


yond the .01 level. 


Correlations between the Sapir Developmental Scale and the 
New York State Reading Readiness Tests and its subtests, and the 
Marianne Frostig Test of Visual Perception and its subtests, ad- 
ministered eight months after the Developmental Scale, are shown in 
Table 2. The Developmental Scale and all its subsections correlate 


TABLE 2 


Correlations between Sapir Developmental Scale and New York State Readiness 
and Marianne Frostig Test of Visual Perception with 54 Children 


Sapir Scale 


r 
„449% 
.28 
.30* 
.36** 
.27* 
10 
26 


14 
.16 
.21 
.04 
—.002 


.23 


Perceptual- 
"Total MotorSapir  Bodily Schema Language 
Sapir Scale Scale Sapir Scale 
r T T 
N.Y. State Readiness .66*** .66*** .58*** 
Word Meaning .22 .24 .15 
Listening .27* .25 .21 
Matching 68° Б .60*** 
Alphabet .58*** .48*** .48*** 
Numbers .03 .16 .07 
Copying Bit bed 05069 .50*** 
Marianne Frostig 
"Test of Visual Per- 
ception 45 .31* T rhini 
Eye-hand .49*** .98** saput 
Figure-Ground Apt .34* .45*** 
Form Constancy .27* -16 .30* 
Position-in-space .16 .08 .20 
Spatial Relation- 
ships 49%” .35* .49*** 
#р = «05 
**p = <.01 


98р = <.001 
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significantly beyond the .01 level with the total New York State 
Reading Readiness score (r = .66, p <.001) and its subtests; 
Matching (r = .63, p <.001), Alphabet (r = .53, p <.001), and 
Copying (т = .57, p <.001). These are the subtests that Gates and 
Bond (1958) have listed as most significantly related to reading. 
The lack of correlation between the language section of the Develop- 
mental Scale and the New York Reading Readiness Test is very 
puzzling. Further study is required. 

Correlations of the Developmental Scale with the Marianne 
Frostig Test of Visual Perception show more variability. The total 
test scores correlate significantly (r = .45, p <.001), but subtests 
Form Constancy and Position-in-Space do not (r = .27; r — .16). 

The Sapir Developmental Scale and its Perceptual-Motor section 
correlate significantly beyond the .01 level with all the subtests 
except Vocabulary of the Stanford Achievement Test administered 
at the end of first grade, seventeen months after the administration 
of the Developmental Scale. Table 3 shows that the Bodily Schema 
section of the Developmental Scale correlates with all but the Vo- 
cabulary subtest of the Stanford Achievement Test beyond the 
02 level. The Language section of the Scale correlates signifi- 
cantly beyond the .01 level with all subtests but Word Study of the 
Stanford Achievement Test. 

"Two test-retest reliability studies were completed, The first study 
was within a two week interval with eleven kindergarten children 


TABLE 3 


Correlations between Scores on Sapir Scale and Stanford Achievement Test 
М = 54 Children 


Total  Perceptual-Motor Bodily Schema Language 
Test Sapir Scale Sapir Scale Sapir Scale Sapir Scale 


r 
Stanford Achieve- ý 3 2 
ment (Form А) 64 .08*** NL NT 
re [ө “bas “40% 
ese “este 5099» aen 
m eed э, КА 19 Age 
. d .55*** .46*** 
ord Study нн gem p 7 
"бя 136°» 509 
*р = «05 
op = «01 
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in Queens Public School No. 201, New York City and a reliability 
coefficient of .92 was obtained. The second was with this popula- 
tion with a nine month interval (Sapir, 1966) and the Pearson 
test-retest correlation coefficient was .84. 

Discussion. The study has shown that with this population de- 
velopmental differences were identified at the kindergarten level, 
persisted in the first grade, and correlated significantly with the 
academic performance seventeen months later. The Developmen- 
tal Seale is now being used with a larger population, and the re- 
sults will be studied to see whether the scores continue to cor- 
relate with academic achievement. 

It must be emphasized that the Sapir Developmental Scale (c) 
is a survey instrument to be used with a total kindergarten popula- 
tion. With it, children can be selected for specialized training. One 
of its major assets is that it is a short test, taking at most thirty min- 
utes to administer. It can eventually be used by kindergarten teach- 
ers carefully trained. In a very general way, the instrument can high- 
light areas of deficiency in a child and can indicate the strength or 
weaknesses of sensory modalities, It is hoped that with such an 
instrument children could be selected for “transition” (de Hirsch, 
Jansky, and Langford, 1965) or “maturity” (Swedish Royal 
Board of Education, 1964) classes to prevent later academic failure. 
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AN EVALUATION OF THE NORMS AND 
RELIABILITY OF THE PPVT FOR PRE-SCHOOL 
SUBJECTS: 


W. L. BASHAW ax» JERRY B. AYERS 
University of Georgia 


Tuts analysis of the Peabody Picture Vocabulary Test (PPVT) 
was conducted in the experimental pre-primary unit of the Clayton 
County—University of Georgia Instruction Demonstration Cen- 
ter.2 РРУТ IQ measurements from multiple testings yielded dis- 
crepancies that were observed to be large and nonsystematic. The 
usual reasons for nonreliability—such as form differences, exam- 
iner differences, individual differences in learning in the inter-test 
period, and actual instability of intelligence at this age level—did 
not appear to account for the observed IQ instability, at least in the 
opinion of the investigators. Since the usage of the PPVT is in- 
creasing rapidly, the critical study of its technical characteristics 
appears mandatory. 

Two analyses were conducted—a rational examination of the 
PPVT scoring and norming process and an empirical examination 
of the stability and equivalence of the various PPVT scores. 

The rational analysis. Reliability. Dunn’s (1965) manual cites 
eleven reliability studies conducted during 1959 to 1965. None of 


1 The research and development reported herein was performed pursuant to 
a contract with the United States Department of Health, Education, and 
Welfare, Office of Education, under the provisions of the Cooperative Re- 
search Program. This research was conducted by the Evaluation and the 
Program Development and Field Testing Divisions of the Research and 
Development Center in Educational Stimulation. The investigators wish to 
express their appreciation to Dr. Charles E. Johnson and Mr. Lacy D. Powell 
for their cooperation. 

2The Center is sponsored jointly by the Clayton County, Georgia, Board 
of Education and the Research and Development Center in Educational 
Stimulation of the University of Georgia. 
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these studies involved subjects in the age range of interest (CA — 
3 to 6). Dunn (1965, p. 30) does report data of his own. He 
presents equivalency reliabilities of raw scores and standard errors 
of 1Q’s. The results are in Table 1. Dunn presents neither reli- 
ability estimates for MA scores nor IQ's. It is obvious that addi- 
tional reliability data at these age levels are needed. 

Norms. In regard to norms, a major problem was discovered 
which the investigators believe to be the principal source of IQ 
instability. Dunn provides IQ norms based on within age-group 
standardized scores, This procedure is the technique most often 
recommended; however, the use of the six-month and one-year 
age groupings leads to serious grouping errors. Other standard 
instruments, such as the Revised Stanford-Binet Intelligence Scale 
L-M and the WISC, use smaller age ranges in their norms, thus 
making this grouping error less severe. The grouping problem 
existed in Dunn’s first edition as was pointed out by Lyman (1965) 
in Buros’ Mental Measurement Yearbook. 

In order to demonstrate the effect of this norm problem, a sec- 
tion of the IQ norm table (Dunn, 1965, p. 18) is reproduced as 
Table 2. Consider a child who is tested at CA = 2 — 8 and who 
can correctly identify 12 plates. His norm IQ is 86 (see the last 
row of Table 2) which is within one standard deviation of the 
mean. However, if he happened to have his testing delayed one 
month to CA = 2 — 9 and still can identify only 12 plates, his 
norm IQ is 70 which is two standard deviations below the mean! 
This drop of 16 points exceeds two standard errors of the 1Q. 
From the point-of-view of IQ stability, the child must learn to 


TABLE 1 
Dunn's Reliability Estimates for Pre-primary Ss* 
CA Reliability SE of IQ Ne 
6-0 67 8.6 183 
5-0 73 7.8 133 
4-6 72 7.9 122 
40 77 1.2 110 
8-6 .81 6.5 119 
3-0 15 7.5 92 
2-6 -76 7.4 92 


*Adapted from untitled table (Dunn, 1965, p. 30). 
bDunn, 1965, р. 27. 
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TABLE 2 
Partial IQ Norm Table for Form B* 
Chronological Age 
Number of — 
Plates Identified 2-3 to 2-8 2-9 to 3-2 

22 103 87 
21 101 85 
20 100 83 
19 98 82 
18 96 80 
17 94 78 
16 93 77 
15 91 75 
14 89 73 
13 87 72 
12 86 70 


“Taken from Table 4, Intelligence Quotients for Raw Scores on Form B (Dunn, 1965, p. 18). 
identify 10 more plates in one month to maintain an IQ of 86 (see 
the first row of Table 2). 

The MA norms are based on the standard technique of linear 
interpolation between mean performance of each age group. This 
procedure does not have the sophistication of standardized scores, 
but neither does it have the serious grouping error. 

Empirical analysis. An empirical analysis was conducted for two 
purposes—to report additional data on pre-primary Ss and to test 
empirically the contention that the IQ norms affect reliability ad- 
versely. 

Procedure. Each child was tested twice—first as part of the 
student selection process during June of 1966 and second in Octo- 
ber of that year. The children had been in attendance in the pre- 
primary program for about six weeks at the time of the second 
testing. Testing was conducted by experienced examiners, The 
assignment of examiners to examinees was random for the first 
testing. The classroom teachers conducted the second testing. Data 
were analyzed in three independent groups depending on which 
forms were used. Group one (№ = 15) was administered Form 
A twice, group two (N = 31) was administered Form B twice, 
and group three (N = 131) was first administered Form B and 
later given Form A. Analyses consisted of summary statistics and 
Pearson product-moment correlations between the two testings. 

Sample. The sample was composed of all enrollees in the ex- 
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perimental pre-primary unit. The original identification of pre- 
primary students was based on stratified random sampling of the 
children whose parents requested that their child be enrolled. The 
sampling was stratified by race, socioeconomic status, and chrono- 
logical age. Socioeconomic status was measured by the Holling- 
shead two-factor index of social position which is based on the 
education and occupation of the head-of-household. 

The composition of the sample is presented in Table 3. The 
sample stratification was based on an attempt to get an equal sex 
distribution, a race mixture reflecting the racial balance of Clayton 
County, Georgia, and a relatively normal distribution of socio- 
economie status. The sample was also stratified on the first PPVT 
testing to get a normal IQ distribution in the pre-primary unit. 

Results. The summary statistics for the three sets of data appear 
in Tables 4, 5, and 6. The first two sets of data include too few 
subjects to warrant generalizing from them; however, they do 
provide some supportive evidence due to similarities to the larger 
third group. As is to be expected, the MA and raw score changes 
are larger for the group that used alternate forms. IQ changes 
were tested by t-tests. The mean change was significant only in 
the larger group (t = 6.9, df = 129). The reader is warned that 
changes in means can be partially due to bias in as much as the 
classroom teachers examined their own students in the second 
testing. 


TABLE 3 
Distribution of Sample Characteristics 
PPVT Forms Taken 
Characteristics A-then-A  B-then-B  B-then-A "Totals 
(oF GASSES uil ec rie НИ ЕВС T 1 p Me re дш? 
Sample Size 15 31 181 177 
Sex 
Males 10 15 68 93 
Females 5 16 63 84 
Race 
Бен 6 31 127 164 
egro 9 0 4 
Socio-Economie Status si 
1 1 2 16 19 
2 2 6 30 38 
3 4 16 49 69 
4 3 5 28 36 
5 5 2 8 15 
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TABLE 4 
Summary Statistics for Group 1 (Form A, Test-Retest, N = 16) 


Variable Mean SD 
CA (months) 
Initial 54.0 9.8 
Final 57.5 9.8 
Change 3.5 1.1 
Raw Score 
Initial 37.4 17.1 
Final 43.7 11.5 
Change 6.3 12.3 
MA (months) 
Initial 49.9 21.7 
Final 54.5 16.5 
Change 4.7 15.9 
IQ 
Initial 89.7 23.5 
Final 91.7 20.5 
Ы Change 2.0 20.8 


The summary of the reliability estimates appears in Table 7. 
The differences in coefficients are in the predicted direction. The 
norm IQ score is much less reliable than the basic raw score or its 
MA equivalent. The reliabilities of raw scores are consistent with 
those reported in Table 1. 

Discussion. Score changes in the empirical study reflect several 
sources of error including differential learning across the summer, 


TABLE 5 
Summary Statistics for Group 2 (Form B, Test-Retest, N = 81) 
Variable Mean SD 
CA (months) 
Initial 46.4 9.6 
Final 50.5 9.7 
Change 4.1 4 
Raw Score 
Tnitial 41.0 10.5 
Final 45.2 9.7 
Change 4,2 7.8 
MA (months) 
Initial 50.4 12.3 
Final 56.3 14.0 
Change 5.9 10.8 
19 
Initial 102.4 15.1 
Final 104.9 13.0 
Change 2.5 18.9 
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TABLE 6 
Summary Statistics for Group 3 (Form B-then-Form A, М = 131) 


Variable Mean SD 
CA (months) 
Initial 47.8 10.6 
Final 51.9 10.5 
Change 4.1 6 
Raw Score 
Initial 39.2 12.2 
48.2 11.6 
ge 9.0 8.5 
MA (months) 
Initial 49.0 15.6 
Final 61.5 17.9 
Change 12.5 12.9 
IQ E 
Initial 99.3 14.1 
Final 108.6 13.3 
Change 9.2 15.4 Р. 
КЕ С ОШЫН Ул Hc ГАВА ОА Ae д ООО 


and during the first six weeks of the pre-primary program, ex- 
aminer differences and form differences. However, it is clear that 
the decrement in obtained IQ reliabilities relative to raw score 
reliabilities is too severe to warrant the use of Dunn’s normed IQ 
at this age level. It is recommended that PPVT users, especially 
with younger subjects, rely only on raw scores or MA equivalents. 

It is interesting to note that for the large sample, the average 
ratio IQ (MA/CA) changes from 97.6 to 118.5, a change of 21 
points. This gain reflects the IQ change more accurately, in terms 
of reliability, than the reported norm IQ change of 9.2 points. In 
the case of the PPVT, users at the pre-school level who feel com- 


TABLE 7 
PPVT Reliabilities 

1. Form A test-retest (V = 15) 
Taw scores .69 
MA .69 
10 .56 
2. Form B test-retest (V = 31) 
Taw scores 71 
МА .67 


IQ .52 

3. Form B-then-Form A (N = 131) 
Taw scores 75 
MA 71 


IQ 37 
А л 0. моз м юш 
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pelled to get an IQ score should forego Dunn's standard score IQ 
in favor of the traditional MA/CA ratio. 
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CONSISTENT SEX DIFFERENCES IN A SPECIFIC 
(DECODING) TEST PERFORMANCE! 


AARON SMITH 


Neuropsychologie Laboratory, Speech Clinic 
University of Michigan 


Tux mest recently developed and highly standardized measures 
of intelligence have shown “clear cut sex differences” (Wechsler, 
1958). Wechsler’s scales for adults and children (Wechsler, 1958, 
1949) administered to large representative samples of American 
males and females from five through 64 years old showed generally 
higher scores by male adults in five subtests, and by female adults 
in three. Male adults showed most consistently higher scores than 
females in Information and Arithmetic. However, since these sub- 
tests revealed little difference between boys and girls up until the 
age of 10 years (Miele, 1954), the emergence of sex differences 
in adult performances suggests the possible influence of culture. 

Although the question of cultural vs. biologic or genetic bases 
for significant sex differences in various psychological test func- 
tions has been long debated (e.g. Terman and Tyler, 1954; Good- 
enough, 1954; McCarthy, 1955; Gallagher, 1964), the largest and 
most consistent difference was in Digit Symbol (DS). DS is also 
the only subtest showing consistent sex differences at all ages, 
with females superior to males from five to 64 years (Wechsler 
1949, 1958; Miele, 1957). 


1 Supported by PHS research grant HD-00370. Thanks are especially due to 
Mr. Don Warner, Assistant Superintendent, Mrs. Geraldine Nesvan, Super- 
visor of Psychological Service, the many teachers, principals and their pupils 
in the Omaha Public School System for their interest and cooperation in 
planning and arranging this study; to Mr. Don Moran and Mrs. Cheryl 
Shukei for the extensive testing; to Dr. John A. Miele for providing data cited 
above, and to Mr. Glenn Slagle and the Biostatistical section of The Nebraska 
Psychiatric Institute for extensive data processing. 
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Although it is one of five performance (nonverbal) subtests, 
Wechsler’s data show DS actually correlated higher with verbal 
than performance IQ (Wechsler, 1958, p. 255). DS is “опе of the 
oldest and best established of all psychological tests. . . . The 
subject is required to associate certain symbols with certain other 
symbols, and the speed and accuracy with which he does it serve 
as a measure of his intellectual ability” (Wechsler, 1958, p. 81). 

In addition to the unique mental functions involved in substi- 
tuting certain symbols for other symbols, DS is also the only 
Wechsler subtest requiring the subject to write responses. Thus, 
apart from the speed and accuracy of the specific mental opera- 
tions, DS performances involve fine visuomotor coordinations in 
perceptions of—as well as the intervening “mental operations” in- 
volved in responses to—the task. Speed and accuracy of writing 
decline with advancing age (Birren and Botwinick, 1951}. It is not 
surprising, therefore, that DS scores of normal adults show the 
earliest and most marked decline with advancing age when com- 
pared to the 10 remaining Wechsler subtests (Wechsler, 1958). 
The considerably smaller decrements in other Wechsler perform- 
ance subtests and slight increments in verbal subtests with ad- 
vancing age; and a study (Kaufman, 1966) showing markedly less 
decline in older groups in an oral digit symbol test, indicate that 
poorer performances in DS reflect reduced efficiency in the fine 
visuomotor coordinations required, more than “deterioration” with 
age of the specific mental capacities involved, 

Speed and accuracy of writing have also been reported to show 
girls superior to boys at all ages (Zazzo, 1948). Consistently higher 
female DS scores, therefore, might simply reflect greater speed 
and accuracy in writing, rather than in the mental operations in- 
volved in substituting symbols, ^ 

The Symbol-Digit Modalities Test. The Symbol-Digit Modal- 
ities Test (SDMT) was designed in our laboratory to help differ- 
entiate effects of brain lesions on different test modalities from 
effects on the so-called “higher” mental functions involved in the 
task. The major difference between the SDMT and DS is that 
instead of substituting a geometric figure (which can only be 
written) for a number in DS, the SDMT requires that a number 
be substituted for а geometric figure (Figure 1). Thus, in the 
SDMT, responses may be made to the same task (substituting 
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symbols) by writing and/or by speaking numbers. Since the per- 
ceptual and mental processes in SDMT performances are 
presumably identical, comparisons of oral and written SDMT re- 
sponses provide evidence of the significance of the response mo- 
dality in this assumed measure of intelligence. 

In preliminary standardization studies, the SDMT was adminis- 
tered to 3680 boys and girls from 8 through 17 years of age. All 
children were in normal classes from the third through the 12th 
grades in 15 different schools in the Omaha Public School System. 
The selection of schools was made by Mr. Don Warner, Assistant 
Superintendent of Education, and was designed to reflect the 
demographic composition of urban Omaha. Written scores for a 
90 second interval were obtained for 1090 boys and 1011 girls. 
Oral scores for a 90 second interval were obtained by the ex- 
aminer’s recording of spoken responses of 784 boys and 795 girls. 
The subject’s score consisted of the number of correct responses, 
Scores of children in normal classes with obvious visual, motor, or 
apparent neurologic defects were excluded. 

The consistently higher mean written scores by girls at all ages 
(Table 1) are consistent with findings in sex comparisons of DS 
scores (Wechsler, 1958, Meile, 1957). Similarly, the increasing dif- 
ferences between older boys and girls in DS in Meile’s samples 
are evident in comparisons of written SDMT scores in Table 1. 
Consistently superior performances by females in substituting 
written symbols, in the SDMT from the ages of eight through 17 
years, and in DS, from the ages of five through 64 years, clearly 
indicate that American females, are superior to males in written 
substitution tests at all ages. Preliminary studies to establish norms 
for oral as well as written SDMT performances also permitted a 
test of the hypothesis that the higher written scores by females in 
symbol substitution tests might be due to their greater speed in 
writing. 

Oral scores of 784 other boys and 795 other girls also showed 
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higher scores by females in all five age groups. Inereasing superi- 
ority by females with advancing age, indicated in sex comparisons 
of written scores, is more marked and consistent in comparisons 
of oral scores. Both oral and written scores show relatively large 
increases by both boys and girls from 8-9 years to 14-15 years, 
and the smallest increase from 14-15 years to 16-17 years. 

Significant differences as a function of socioeconomic back- 
ground were anticipated and confirmed in comparisons of children 
attending schools in “poor” vs. “rich” neighborhoods. Although 
nutritional factors may be significant, mean scores generally corre- 
lated positively with socioeconomic status. Sex comparison groups 
were equated for socioeconomic status. The markedly smaller in- 
crease in mean oral scores of boys than girls from 14-15 to 16-17 
years may be due to undefined and uncontrolled sampling varia- 
bility. Another possible explanation for this small increase is that 
the maximum average ability of males (which is significantly less 
than that of females in these samples) may be reached at an 
earlier age. Nonetheless, although 14-15 year old girls were gen- 
erally one full grade lower than 16-17 year old boys, the girls' 
written and oral mean scores were both higher than those of the 
older boys. 

Although data for children in special classes were excluded, and 
an effort was made to obtain a representative cross-section of nor- 
mal Omaha publie school children, the population is obviously 
Selected, i.e., children in an urban mid-western publie school sys- 
tem. The demographie character of the midwest with a primarily 
indigenous native-born population differs considerably from that 
of large American coastal cities with higher proportions of immi- 
grants and families in which two languages are used at home. The 
extent to which these SDMT findings can be generalized to other 
more or less similar populations is therefore unknown. 

As noted above, however, the consistently higher oral and writ- 
ten SDMT scores by females in comparisons of 1874 boys and 
1806 girls (8-17 years old) in Omaha public schools are in agree- 
ment with patterns of consistent female superiority in written DS 
scores of 1100 boys and 1100 girls (5-15 years old) and 850 males 
and 850 females (16-64 years old) in samples carefully selected to 
reflect the demographie composition of the American population 
as a whole. 
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It might be argued that consistently higher symbol substitution 
test scores by females in oral as well as written responses may 
simply reflect, greater speed in speaking as well as in writing, 
rather than in the mental operations involved (since girls have 
been reported to acquire speech at an earlier age than boys); or 
that higher oral and written scores by females may reflect their 
superior oculo-motor coordination involved in both forms of test- 
ing; or that females have greater capacity for concentration and 
are less distractable. 

However, though no causal relationships can be adduced, certain 
other biological differences distinguishing the human sexes have 
been reliably reported. Although the average newborn male brain 
is slightly larger and heavier than that of the female, females show 
earlier subsequent biological development reflected in comparisons 
of bone age and in the advent of puberty approximately one and 
one-half years before males. 

The effects of gonadotrophie hormones on total brain functions 
are unknown. However, the earlier appearance of puberty in fe- 
males reflects earlier maturation and development of hormone- 
secreting functions of the anterior pituitary gland at the base of 
the brain. It is interesting to note, therefore, that the most marked 
sex differences in SDMT oral and written scores occur shortly 
after the age of puberty and following the sudden and sharp in- 
crease in production of estrogens by females, 

If superior substitution test performances by females simply 
reflected greater speed and efficiency in writing, speaking or 
oculo-motor coordination, how can we explain the differences be- 
tween males and females in the underlying cerebral mechanisms 
involved? On the other hand, it may also be that for whatever 
unsurmisable biological and/or cultural factors, females are also 
simply superior to males in the Specific mental functions involved 
in this single aspect of test intelligence. 
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SCALE PROPERTIES OF THE INTEREST INDEX 


GERALD HALPERN? 
Educational Testing Service 


A full history of the Interest Index,? is available (Halpern, 1966), 
which traces its early development for the Progressive Education 
Association (Smith and Tyler, 1942) through a series of studies 
carried out by French (1964). In its present form, this test consists 
of 192 L-I-D items (16 items for each of 12 scales). Because 
previous research was restricted to half-length scales, the present 
study was undertaken to (1) examine the internal consistency of 
each scale, (2) ensure that each item provided differential assess- 
ment, (3) examine interscale correlations, and (4) compare the 
average scale scores for various subgroups of subjects. 

Procedure. The anticipated client population being senior high 
school students, the subjects for this study were grade 11 students. 
In each of five school systems, high schools were selected where 
50 per cent of the graduates went to college and where the enroll- 
ment was approximately equal in numbers of boys and girls. Test 
administration was carried out by teachers or guidance counselors 
in intact homeroom classes. Retesting in the same homerooms took 
place within 19 to 23 days of the first testing. Of the 2,024 students 
present at the first testing, 1,859 (92%) were also present at the 
second testing. A loss of this size may be attributed to normal 
absence from class on any given school day. 

Results and discussion. All analyses were performed separately 
for four subgroupings formed by (a) sex and (b) intent to go to 
college. Complete results were presented, separately for each 


1Now at the Collegiate Institute Board, Ottawa, Ontario, Canada. 
2 Тһе name of this instrument has since been changed o Academic Interest 
Measures (AIM), a title more descriptive of its objectives. 
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group (Halpern, 1965). Scale internal consistencies were computed 
using Kuder-Richardson formula 21 and ranged from .86 to .95 
with a median of .91. Test-retest scale correlations, three week 
interval, were .77 to .92 with a median of .86. These coefficients 
were found to compare favorably with analogous coefficients re- 
ported for similar tests. 

Differential assessment. The high internal consistency coeffi- 
cients suggested that the items comprising any one scale repre- 
sented a common interest area. To test whether the items also 
provided differential assessment, the relationship of each item to 
all of the scales was examined. If an item was primarily related to 
the interest area to which it had been assigned, then that item 
should have had at least as high a correlation with the scale for 
that interest area as it had with any of the other interest area scales. 

It should be noted that the goal of differential assessment of 
interests, taken alone, would require that each item enjoy a sig- 
nificantly higher correlation with its assigned scale than it has with 
any other scale. This more stringent criterion, however, conflicts 
with the need of an interest-assessment instrument to sample from 
all the activities associated with an interest area (Kuder, 1954; 
Nunnally, 1959, p. 59) and with the natural relationships between 
certain interest areas. To the extent that two related interest areas 
manifest themselves in similar activities, they cannot be differ- 
entially measured (Wesman and Bennett, 1951). To force differ- 
entiation of correlated interests would introduce bias in the activ- 
ities sampled through a disproportionate selection of activities 
peripheral to the interests being measured, 

The correlations between each individual item and each of the 
12 scales (corrected for overlap) were rank ordered, and the rank 
of the scale to which the item had been assigned was determined. 
Tied ranks were used where the difference between two correla- 
tions was not significantly different from chance (p < .01). 

Of the 768 items examined (192 repeated in each of four sub- 
groups), 708 (92%) correlated higher with the scale to which 
they had been assigned than with any other scale, Among the 
remaining items, 56 (7%) correlated as high with some scale (e.g., 
Secretarial) other than their own as they did with their own scale 
(e.g, Executive). Examination of these items showed that, in all 
cases, they were correlated with scales meaningfully related to 
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their own scales. The remaining four items (1%) correlated higher 
with a scale (e.g., Engineering) other than the one to which they 
had been assigned (e.g., Physical Sciences). Additional research is 
in process to replace these four items which were rejected by the 
criterion for differential assessment. 

Interscale correlations. It was pointed out above that where 
natural relationships exist between certain interest areas, one can- 
not expect their measurement to be orthogonal. The interscale 
correlations obtained in this study were categorized as being 
“high” or “low” on the basis of the amount of variance they 
shared. Scales having 25 per cent or more of their variance in 
common (r > .50) were classified as high. Scales sharing less than 
25 per cent of their variance were low. Of the 264 different corre- 
lations (66 for each of the four groups), only 21 (8%) were .50 
or greater. Where the scales were highly correlated, inspection of 
the scale labels revealed appropriate relationships. That is, interest 
areas that are expressed by similar activities were correlated. 

Average scale scores. In addition to the above indices, average 
scale scores and their standard deviations were computed. These 
are presented in Table 1 and serve as a first approximation to 
norms. Inspection of this table reveals that the average scale scores 
obtained are in line with the usual expectations of interests within 
and between these groups. Females were higher than males on the 
English, Fine Arts, Secretarial, Foreign Languages, Music, and 
- Home Economies scales. Males were higher on the Mathematics, 
Physical Seiences, and Engineering scales. Educational level aspira- 
tions were related to several scales, with the students who planned 
to go to college generally having relatively high interest scores. 
The groups planning to go to college expressed higher interests 
in Biology, English, Mathematics, Social Sciences, Physical Sci- 
ences, Foreign Languages and Music. The students who did not 
plan to go to college had relatively higher interest scores on the 
Secretarial, Home Economics, and Executive scales. In terms of 
the commonly held stereotypes of these groups, the interest pat- 
terns offer no surprises. 

"These average scores contribute to the construct validity of the 
scales in that they agree with “postulated attribute (s) . . . assumed 
to be reflected in test performance” (Cronbach and Meehl, 1955). 
That is to say, the writer's expectations about interest differences 
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TABLE1 
Means and Standard Deviations for the Interest Indez Scales 


Male, Male, Female, Female, 
College Noncollege College Noncollege 
М = 577 М = 362 М = 593 М = 499 


Scale M SD M 8р M SD M SD 
Biology 18.6 9.0 16.8 9.0 18.0 8.7 14.9 8.5 
English 15.4 8.2 10.5 7.8 20.6 7.7 148 7.8 
Fine Arts 14.0 8.1 15.8 8.6 20.4 8.4 18.3 8.3 
Mathematics 19.5 8.8 15.2 8.8 14.4 8.6 10.8 8.6 
Social Sciences 19.8 9.0 14.9 9.2 19.4 9.0 14.5 9.0 

i 17.0 6.8 16.7 7.1 19.7 7.7 24.8 6.6 
Physical Sciences 21.0 8.4 19.1 8.9 12.5 8.5 8.8 8.0 
Foreign Languages 15.4 10.0 11.6 9.7 22.0 8.6 16.0 10.0 

usio 14.7 9.2 13.1 8.6 18.3 8.4 14.5 8.5 
Engineering 22.0 8.2 23.7 7.6 10.5 8.0 9.5 8.2 
Home Economies 12.8 7.1 13.4 7.5 24.5 6.7 27.3 5.5 
Executive 19.4 7.4 17.2 8.0 16.9 6.9 20.0 7.3 


between sex and college plans are in agreement with the measured 
interests of these groups. The claim is not made at this time that 
the validity of this test has been "proven" by such differences. One 
aspect of validity has been partially confirmed, however, and this 
does provide encouragement for further validity studies. 

Summary and conclusions. An interest inventory was adminis- 
tered to 1,859 grade 11 students on two occasions with a three- 
week interval. Scale scores were analyzed separately for each of 
four subgroups classified by both sex and college plans. Individual 
scale internal consistencies were found to cluster about .91. With 
the exception of four items, every item correlated at least as high 
with the scale to which it was assigned as with any other scale. 
Test-retest reliabilities for the individual scales clustered about 
86. Intercorrelation of the scales showed a remarkable degree of 
+ scale independence. One aspect of the construct validity of the 
test was demonstrated when the interest levels of the groups, rela- 
tive to one another, were shown to agree with the expected in- 
terest profiles for the groups. 
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CROSS VALIDITIES OF SIX SVIB DISCRIMINANT 
EQUATIONS FOR FEMALE STUDENTS IN 
HEALTH AND EDUCATION! 


GEORGE H. DUNTEMAN? 
University of Florida 


An earlier investigation (Dunteman, 1966) demonstrated that 29 
Scales of the Strong Vocational Interest Blank for Women 
(SVIB-W) could successfully discriminate among five curricula 
groups of female college students in health and education. The 
level of statistical significance as well as the level of practical sig- 
nificance were quite encouraging. One 29 scale analysis and two 
11 seale analyses yielded multivariate test statistics that were all 
significant well beyond the .001 level. Classifying the original 200 
students on the basis of classification equations (likelihood func- 
tions) developed in the discriminant function space resulted in an 
average of 53 per cent correct classifications for the three analyses. 
The purpose of the present investigation was twofold. First, cross- 
validate the classification equations on a new sample. Second, com- 
pare the relative effectiveness of a number of classification schemes 
on a cross validation sample. 

Since the time of the earlier investigation, the SVIB profiles 
were obtained on 135 additional female students who are now 
currently enrolled in Medical Technology (MT), Occupational 
Therapy (OT), Physieal Therapy (PT), Nursing (N), and Edu- 
cation (E) at the junior or senior level. The original analysis was 


1 This research was supported in part by a research grant, #RD-1127, from 
the Vocational Rehabilitation Administration, Department of Health, Edu- 
cation, and Welfare, Washington, D.C. Thanks are due to John P. Bailey, Jr. 
Jeffrey Kanov, e the University of Florida Computing Center for the 
Statistical analyse: 

? Now at the Y ducational "Testing Service, Princeton, New Jersey. 
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based upon 200 female students in MT, OT, PT, N, and E who 
with the exception of the Ns were either graduates of one of these 
particular programs, or who were currently enrolled as either jun- 
iors or seniors. The Ns were nursing students who had succes- 
fully completed their sophomore year. Originally, it was planned 
to eross validate the classification equations derived on the sample 
with which the earlier investigation was concerned. However, it 
was felt that a more meaningful and reasonable two stage analysis 
(validation and cross validation) would be to combine all the re- 
cently collected data for currently enrolled students (juniors and 
seniors) with the original sample and then select the graduates 
from the total sample to be used for the development of the 
multivariate classification equations. The sample on which the 
equations would be cross validated would be the remainder of the 
students from the total sample or those students who were pres- 
ently enrolled as either juniors or seniors in one of the five cur- 
ricula. This two-stage analysis seemed most reasonable because of 
two considerations: (1) the groups for the validation sample on 
which the classification equations were to be based would be more 
homogeneous (all graduates) than the earlier sample (graduates 
plus currently enrolled students) and consequently more mean- 
ingful and effective classification equations might be developed 
for cross validation purposes; (2) the cross-validation sample 
would be composed of the more recent students and hence the 
equations developed on the graduate sample would be utilized to 
classify future observations (students). The validity would be pre- 
dictive rather than concurrent and closely correspond with a 
"real-life" classification schema. 

Method. The new validation sample was composed of 191 grad- 
uate MTs (39), OTs (48), PTs (25), Ns (46), and Es (33) and 
the cross validation sample was composed of 144 currently en- 
rolled MTs (12), OTs (33), PTs (14), Ns (52), and Es (33). The 
classification equations were to be developed both on the basis of 
ee al 29 variables? and using only those variables that yielded 
univariate F's that were significant at the 001 level. The purpose 
of varying the number of variables was to determine the effects of 
deleting variables on classification efficiency on the cross-valida- 


7 a The scores ion the engineer, sister à е 
were not used because of missing d m and physical therapist scales 
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tion sample. Also, concurrently, the effects of the degree of “fit” 
of the classification equations (likelihood functions) on the valida- 
tion sample had on the classification efficiency for the cross 
validation sample were to be compared by using classification 
equations varying in the amount and kind of information (i.e. 
degrees of fit) used to develop the equations. 

One classification equation used was Cronbach and Gleser's D* 
(Cronbach and Gleser, 1953) which is simply the sum of the squares 
of deviation scores from the test means for a particular group. In 
matrix formulation this is simply z'/Iz, where J is an identity matrix 
and z is a vector of deviation scores (X;#:-." — Хи, 1”) for person j 
and group g. А Р? is computed for each person and each group, 
and the person is subsequently classified into that group which 
yields the lowest D^. D? indicates the distance of the person's test 
score in the test space from the group mean. This statistic does not 
consider the intercorrelations among the tests and also leaves the 
variables in their original scale such that variables with the largest 
variances make the largest contributions to the D°. The I matrix 
with its off-diagonal zeros symbolizes the assumption of non-cor- 
related variables. The matrix equation z'Iz could be written g'g. 
This technique was the simplest of the classification equations used 
and accordingly took into consideration the least amount of in- 
formation. 

Another technique considered was x? = z'D''z, where D™ is 
the inverse of the pooled within-group dispersion matrix for the 
five groups. The inclusion of 27: in the classification equation 
instead of I indicates that information concerning the covariances 
and variances of the variables was utilized. This x^ is proportional 
to a likelihood function when equal dispersion matrices are assumed. 
The last and most complex technique utilized was р(х) = x'D; ‘в + 
log, |D,|, where D,~* is the inverse of the within-group dispersion 
matrix for group 9 and |Р, is the determinant of the within-group 
dispersion matrix for group g. It was computed on the basis of a 
modification of Cooley and Lohnes’ (1962) classification program 
(see Dunteman, 1966). The p(x) is proportional to a likelihood func- 
tion that assumes that the within-group dispersion matrices among 
groups are unequal. This classification equation is the most complex 
and utilizes the most information. It not only considers the covar- 
1апсез and variances of the variables for all five groups, but also 
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considers that they differ among the groups. For one 29 variable |. 


analysis, the reduced discriminant space of four dimensions was 
utilized instead of the test space because of computational considera- 
tions. If the within-group dispersion matrices are fairly similar, 
then approximately the same level of classification efficiency is 
Obtained as if the original test space had been used (see Cooley 
and Lohnes, 1962). However, in general, the reduced space model 
is somewhat less efficient than the test space model. The equation 
WAS ту x, Where z,, was a vector of deviation scores 
from the group means of the four discriminant functions and D,,( * 
is the within-group discriminant space dispersion matrix for group g. 
For more details on these classification schemes, the reader is re- 
ferred to Dunteman (submitted). 

It should be noted that for each of these classification equations, 
a person's test score is substituted into five equations (ome for each 
group) and she is predieted to belong to that group which yields 
the highest index. 

Results. A discriminant analysis of the validation sample yielded 
a Wilks’ lambda (see Cooley and Lohnes, 1962), that was signifi- 
cant beyond the .001 level of significance indicating that the 29 
scales were significantly discriminating among the five groups. 
Nine out of the 29 scales yielded univariate F's significant at the 
001 level. These nine scales were the Oceupational Therapy 
(OT), Laboratory Technieian (LT), Physician (MD), Social 
Worker (SW), Dentist (D), Math-Science Teacher (MST), Music 
Performer (MP), Music Teacher (MT), and Femininity-Mascu- 
linity (FM) scales. Consequently, one set of analyses involved 
developing classification equations for all 29 variables, and the 
other set of analyses involved developing equations for the nine 
variables previously listed. All classification equations for the nine 
variable analyses were based upon the test space, while one 29 
variable analysis was based upon the reduced discriminant function 
space. In most practical situations, the differences are rather small 
and the reduced space model results in about the same classification 
efficiency as the test space model. 

The configuration of the centroids of the five groups in the 
discriminant function space of the present analysis was highly 
similar to the configuration of the буе groups in the earlier in- 
vestigation. This seems Teasonable, since there was a large overlap 
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between the students used in both studies. Table 1 shows the 
percentage of students correctly classified by using both nine and 
29 variables under various classification schemes. 

Discussion. From Table 1, it can be seen that utilizing 29 vari- 
ables gives a higher average of correct classification than does using 
only nine variables. However, it can be seen that one of the better 
techniques was using the pooled within-group dispersion matrix for 
nine variables and classifying in the test space. Notice that the 
techniques using the greatest amount of information (i.e., separate 
within-group dispersions) and the least amount of information (i.e., 
Cronbach and Gleser's D?) fared the worst under cross validation. 
"This would seem to indicate that an overfit or underfit of the valida- 
tion sample data might yield an equation that would shrink more 
under conditions of cross validation than a moderate fit such as 
x’D-z, Iteis known that 20,1 т is less efficient than 20-12 if 
the population dispersion matrices are identical. 

It is also interesting to note that the reduced space equation for 
29 variables is about as efficient as one of the test space models for 
nine variables. The reason for not classifying in the 29 variable 
test space was that the dispersion matrix had elements that were 
too large to invert using the available subroutine. The dispersion 
matrix could have been scaled to a correlation matrix and then 
inverted, but an approximate inverse that was probably quite close 
to the actual inverse yielded 51 per cent correct classifications 
when the approximate inverted pooled within-group dispersion 
matrix was utilized. 

It should be noted that the manner in which the nine variables 
were selected was not optimum in the sense that another group of 


TABLE1 


Percentage of Students Correctly Classified ae fon Combinations of 
Variables and Classification Si 


Number of Variables 


Technique 9 29 
z'Iz 34% 42% 
z'D-z 46% 51% 
æD 32% — 
210,712 + log, |Р, 37% — 
a Dag iz SS 47% 


“Based upon an approximate inversion of D. 
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nine variables might have yielded higher classification efficiency. 
Tt is encouraging, however, that one equation using nine variables 
selected in this easy manner held up extremely well upon cross 
validation. All six classification equations performed better than 
chance and some performed quite better than others. 

The differences in classification efficiency (shrinkage) for each 
of the six equations on the validation and cross validation samples 
ranged from 2 per cent (29 variable Cronbach and Gleser D?) to 
19 per cent (9 variable z’ О; 12). The least or moderate fitting 
equations had the least shrinkage. The shrinkages for the most com- 
plex equations were relatively large enough so that even though they 
classified more efficiently on the validation sample (as would be 
expected) they underperformed the simpler techniques on the cross 
validation sample. 

In the earlier investigation the per cent correctly classified was 
54 per cent of the validation sample, while in the present investiga- 
tion 58 per cent of the validation sample were correctly classified 
using the same equation, 2/D,, с as in the earlier investigation. 
Increasing the homogeneity, then, by including only graduates in 
the validation sample might have slightly increased the predictive 
power of the new equations. 

Three out of the four nine variable equations were less effective 
than the 29 variable equations. However, as mentioned previously, 
these nine variables were not necessarily optimal. The simplest 29 
variable equation, Cronbach and Gleser’s D?, does better than three 
out of the four nine variable equations. The interpretation of the 
results concerning the relative effects of number of variables and 
degree of fit must be regarded as speculative since a number of 
Important dimensions in regard to classification efficiency were 
not varied (e.g., sample size) and other dimensions were not varied 
enough (e.g., number of variables). More important, these results 
were based upon only one sample, and as such there was no control 
for sampling fluctuation. The cross validities for some of these 
SVIB equations for groups as intrinsically similar as the groups 
used in the present study were quite encouraging. 

: Summary. The present investigation concerned a cross valida- 
ton ова discriminant analysis of an earlier investigation (Dunte- 
Шап; 1966). Six different classification equations (likelihood func- 
tions) derived from various scales of the Strong Vocational Interest 
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Blank for Women (SVIB-W) were cross validated on a more 
recent sample of 135 female students currently enrolled in Medi- 
cal Technology, Occupational Therapy, Physical Therapy, Nurs- 
ing, and Education at either the junior or senior levels. All 
six equations yielded a higher percentage of correct classifications 
on a cross validation sample than would be expected by chance 
and some equations performed much better than others. Indica- 
tions are that the most effective equations on cross validation were 
those that were developed on the basis of a “moderate” fit to the 
validation data. 
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VALIDITY OF SCORING METHODS 
FOR BIPOLAR SCALES 


LEONARD V. GORDON 
State University of New York at Albany 


ІлкЕвт-түрю attitudes scales are widely used for psychological 
measurensent. These scales traditionally provide a set of response 
categories about some neutral point. For example, where seven 
response categories are used, the respondent specifies strong, mod- 
erate or slight agreement, neutrality, or slight, moderate, or strong 
disagreement for each item. The respective alternatives are usually 
weighted +3, +2, +1, 0, —1, —2 and —3. The conventional scoring 
of a Likert scale involves the summing of the weights for the 
responses made by the individual to the complete set of items. 

The individual’s responses are comprised of two psychologically 
distinguishable components. The first is the direction of his re- 
sponse, that is, whether he agrees or disagrees with the statements. 
The second is the extremeness of his response, whether he tends to 
select extreme alternatives in agreeing or disagreeing. Cronbach 
(1950) has argued that the selection of extreme responses may 
largely represent response-set variance, and that since this type of 
variance tends to be non-valid, Likert scales might better be scored 
dichotomously. Peabody (1962) examined the direction and ex- 
tremeness components separately in order to assess their relative 
contribution to the total scores as conventionally determined. He 
concluded that the total scores “. . . reflect primarily the direction 
of response” (p. 73). 

Korn and Giddan (1964) performed a similar analysis on the 
Dogmatism scale (Rokeach, 1960) and found a positive relation- 
ship between strength of agreement and strength of disagreement 
revealing the existence of an extremeness component. They also 
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found very similar corresponding correlations for the conventional 
Score and the dichotomous score with selected aptitude and per- 
sonality measures. They concluded that since the extremeness 
component does not significantly affect relationships with out- 
side criteria, the simpler dichotomous scoring should be used. 

The method of analysis used by the above investigators suc- 
cessfully identified a response set associated with extreme re- 
sponses, and indicated that these extreme responses added little of 
value to the total score as conventionally obtained. However, their 
findings do not preclude the possibility of this “response set” 
making a positive contribution to the validity of an instrument 
when some other method of keying is employed. 

There is good reason to believe on logical grounds that the 
so-called extremeness response set will contribute valid variance 
to Likert-type attitude scales. Individuals who are high on the 
construct being measured by the scale will endorse more items at 
the positively scored pole than at the negatively scored pole. For 
these people, an extremeness response set will result in higher net 
Scores, since the loss at the negative pole will be more than com- 
pensated for by the gain at the positive pole. Similarly, individuals 
who are low on the construct will endorse more items at the 
negatively scored pole, and the tendency to mark extreme items 
will result in a lower score than they would otherwise obtain. 
Finally, for individuals who are intermediate on the construct, 
extremeness responses at both poles should counterbalance one 
another, leaving their scores relatively unchanged. 

However, the contribution made by the extremeness response 
sets would not necessarily be expected to be equivalent at the 
negative and positive poles, since it has been found that response 
sets tend to operate differentially at the two poles (Gordon, 1952; 
1958). Thus, it would be important to make an empirical deter- 
mination, for the scale in question, of the extent to which extreme 
гевропвев, in effect, do differentially discriminate at each pole. 
Where such discrimination can be reliably established, alternatives 
can be weighted so as to capitalize on this differentiation. 

= Present study was performed to compare the validity of 
КН qunm method, based on a empirical analysis of the differ- 
ential discriminating power of extreme alternatives, with the 
method recommended by Peabody (1962) and by Korn and Gid- 
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dan (1964) and with the traditional Likert scoring method. Rela- 
tionships among these scores and extremeness scores also are to be 
examined. А 

Procedure. The instrument used in the present investigation is 
the Work Environment Preference Schedule—WEPS (Gordon, 
1966), which is composed of 30 statements designed to reflect 
attitudes and behaviors that are fostered and rewarded in most 
large or highly structured organizations (Weber, 1946; Merton, 
1949). Typical items are “In dealing with others, rules and regula- 
tions should be followed exactly," and “There is really no place in 
a small organizational unit for the nonconformist.” The respondent 
uses a five-category response mode, specifying whether he 
strongly agrees, agrees, is undecided, disagrees or strongly dis- 
agrees with each statement. 

The WEPS may be scored conventionally by assigning weights 
of +2, +1, 0, —1, and —2, respectively, to these alternatives. The 
Score obtained from this weighting procedure has been referred to 
28 the Composite (C) score. A second score, Proportion Agree 
(P), consists simply of the proportion of “Strongly Agree" and 
"Agree" responses. This P score was recommended by Peabody 
(1962) and by Korn and Giddan (1964) as an operational sub- 
stitute for the C score. The above investigators also developed two 
extremeness scores, one representing strength of agreement (X) 
and the other strength of disagreement (Y). The X score is ob- 
tained by weighting responses of "Strongly Agree" by 2 and 
“Agree” by 1, and obtaining the arithmetic mean. The Y score is 
similarly obtained with a weighting of 2 for “Strongly Disagree” 
and 1 for “Disagree” responses. 

The fifth score, unique to the present study, was devised to 
capitalize on any differential discrimination provided by extreme 
responses at either or both poles. WEPS data were obtained from 
samples representing varied populations. Internal item analyses 
then were performed in order to determine the consistency with 
which each response alternative differentiated between the upper 
and lower groups from sample to sample? It was found that in 


1The nomenclature and abbreviations used by the other investigators are 
adopted here to facilitate comparisons. 

2Seven samples were used in the item analysis process. These consisted of 
male and female civil service employees, guidance counselors and Peace Corps 
volunteers, and male community college students. The use of varied samples 
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most instances the “Strongly Agree" and “Agree” alternatives 
were similar to one another in their diserimination between high 
and low groups. At this pole, the extremeness response set did not 
contribute uniquely to within-group differentiation. On the other 
hand, the “Disagree” and “Strongly Disagree” alternatives typi- 
eally did differ from one another in their discriminations between 
high and low groups, the “Strongly Disagree" alternative provid- 
ing a more pronounced differentiation. Interestingly, ^Undecided" 
and “Disagree” were found to be quite similar to one another 
in their internal statistics. On the basis of the above analysis, 
"Strongly Agree" and "Agree" were weighted 2, "Undecided" and 
"Disagree" 1, and "Strongly Disagree" 0. This scoring scheme will 
be referred to as Empirical (E). 

The validity of the several scoring schema were next assessed 
against criteria of practical significance, to which the WEPS would 
be expected to be related. Two samples, which had not been used 
in key development were employed in the validation study. 

The WEPS is a measure of the extent to which an individual 
endorses statements reflecting attitudes, behaviors and procedures 
fostered and rewarded in highly structured, authoritarian type or- 
ganizations. Thus, scores on the WEPS would be expected to 
predict an individual's motivation toward remaining in the military 
service or in a military training program. Data from two studies, 
each of which employed a military motivation criterion, were used 
for the present analysis. In the first study, the WEPS has been 
administered, under non-threat conditions, to an entire senior class 
of 472 cadets at the U. S. Military Academy, one week prior to 
graduation (Bridges, 1967). Statements of intention to remain in 
the service after the cadets’ obligated tour of service were then 
obtained. The statements, coded into five categories to represent 
degrees of intent, served as the criterion. In the second study, the 
WEPS had been administered to an entire first year Air Force 
ROTC class of 172 students at Boston College (Bronzo, 1966). 
Resignation versus remaining in the program at the start of the 
second year was the criterion used in this study. The resignation 
rate for this particular sample was 45 per cent. Validities, in the 
form of product moment correlation coefficients with the con- 


о. Иа 
ee analysis purposes tends to yield highly stable weights (see Wherry, 
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tinuous Military Academy criterion and biserial correlation coeffi- 
cients with the dichotomous ROTC criterion, were obtained for 
all five scores. For each sample, product-moment correlations were 
also obtained among all pairs of scores. As a matter of interest, 
intercorrelations among these five WEPS scores were also com- 
puted for the most highly contrasting group used in this original 
item analysis, the 38 male and 20 female Peace Corps Volunteers. 

Results. It will be noted in Table 1 that in both studies the 
Empirical key (E) based on an analysis of the discriminating 
power of extremeness responses has the numerically highest va- 
lidity. Differences between the E and P scores, considered for the 
two studies together, are significant beyond the five per cent level. 
While differences between the E and C scores do not achieve 
conventional levels of significance, in both studies the E score has 
the higher validity, and in the absence of counterindicative infor- 
mation would be the score of choice. As had been found in the 
previous research, the validities of the C and P scores are quite 
similar in magnitude, the C score being slightly larger numerically. 

Significant validities are found for the Y score but not the X 
score in both samples. Thus, more valid discriminations are ob- 
tained between the two alternatives reflecting disagreement, than 
between those representing agreement. The greater differentiation 
between the disagreement alternatives reflects what had been 
found in the earlier internal item analysis. 

Table 2 presents intercorrelations among the scores for the 
ROTC, U. S. Military Academy, and Peace Corps samples. These 
interrelationships are extremely similar to counterpart correlations 


TABLE 1 


Validities of Five Types of WEPS Scores Against M: ilitary Career Motivation 
Criteria for ROTC (М = 172) and О. S. Military Academy (N = 472) 


Samples 
— — SS E 
SAMPLE 
SCORE ROTC USMA 

E Empirical .44* .41* 

P Proportion Agree ‚85* .37* 

C Composite .36* .39* * 

X Extremeness-Agree .15 .05 

Y Extremeness-Disagree —.20* —.20* 


*Significant at the .01 level. 
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found by Peabody, and by Korn and Giddan and also are highly 
uniform from group to group. 

Positive correlations between the X and Y scores also had been 
found in the earlier studies and had been cited as evidence for the 
existence of an extremeness response set, since extreme responses 
in one direction based on content would be logically expected to 
be associated with less extreme responses in the opposite direction. 

The very high correlations between the C and P scores had 
been noted in both previous studies and had served as one basis for 
the recommended substitution of the P score for the C score. The 
larger correlations of the © and P scores with the Y scores than 
with the X scores, previously noted, again suggest that the dis- 
criminations between the negative response alternatives were more 
reliable than those between the positive alternatives. 

E scoses, unique to the present study, are logically most closely 
related to the C scores, both sharing the more complex system of 
weighting of response alternatives. E and С scores are also fairly 
highly related to the Y scores, all three types of scores utilizing a 
differential weighting of the two disagreement alternatives. 

Discussion. The results of previous and present research, agree 
very closely in their outcomes. In both, positive correlations between 
the X and Y scores provide evidence for the operation of an ex- 
tremeness response set; P scores and the C scores are substantially 
correlated and have similar correlations with external variables, 
suggesting that P scores may be substitutable for О scores. How- 
ever, the present research goes beyond the earlier studies, in examin- 
ing the validity of a new approach to scoring Likert scales. 

It has been popularly held that response sets, per se, are “bad” 
and that their effects should be eliminated or somehow counter- 
acted. However, a rationale supporting the position that an ex- 
tremeness response set might well function to increase the validity 
of Likert scales was presented, and empirically tested. On the basis 
of the obtained findings, it can be concluded that it is possible to 
develop scoring weights which will capitalize on extremeness уагі- 
ance and which will yield in cross validation, validities higher than 
those resulting from procedures devised to minimize such variance. 

Reasonable confidence may be held in the outcome of the pres- 
ent research, since data from varied populations were used in the 
development and cross-validation of the weights. Also, for the 
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validation study, eriteria were selected which would clearly be 
expected to be related in a meaningful and direct manner to the 
construct measured in the Likert scale, with tenuous or com- 
plexly mediated relationships avoided. Thus, the present study is 
believed to represent a fair test of the relative validity of the 
Several seoring methods. Studies of more popularly used instru- 
ments (e.g. Dogmatism scale), employing this type of scaling 
would be well worth undertaking in order to test further the 
generality of present results. 
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SUBCULTURAL VARIATIONS IN THE VALIDITY 
OF THE CALIFORNIA F-SCALE* 


STEPHEN M. SALES 
The University of Michigan 
. AND 
NED A. ROSEN 
Cornell University 


Іх principle, the process of construct validation (Cronbach and 
Meehl, 1955) is not overly complicated. A researcher begins with 
some “construct,” or postulated attribute of the individuals to be 
measured. He then generates a variety of hypotheses, each based 
on the interpretation he makes of the construct in question. Finally, 
he builds some instrument designed to measure this construct and 
employs this instrument in investigations which focus upon each 
of the previously developed hypotheses. If the various hypotheses 
are supported, both the instrument and the underlying interpreta- 
tion of the construct are said to have been “construct validated.” 
If the investigations produce negative results, either the instru- 
ment or the theory (or, in some cases, both) become suspect. 
Behavioral (and physical) scientists of a variety of persuasions 
have long been engaged in this procedure. 

One of the instruments which has undergone such a process of 
construct validation, and which has apparently been accepted as 
valid, is the authoritarianism scale developed by Adorno and his 
associates during the late 1940's (Adorno, Frenkel-Brunswik, Lev- 
inson, and Sanford, 1950). It is beyond question that there were 


1The authors are indebted to Walter Nord and Allan Schwartzbaum for 
their assistance during the research phases of the investigation. The Research 
and Publications Division of the New York State School of Industrial and 
Labor Relations provided financial support. 
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severe methodological difficulties in the study from which the 
F-Scale arose and that important methodological problems con- 
tinue to plague the instrument. Furthermore, it is clear that many 
of the propositions within the theory developed by Adorno её al. 
either have not been investigated or have been disconfirmed. Nev- 
ertheless, with these shortcomings taken into consideration, the 
F-Scale still generally seems to predict those covariances and, in 
some cases, those behaviors which (according to the theory) it 
should predict. In their extensive review article, Christie and Cook 
(1958) have accepted the construct validity of the F-Seale (p. 
189); Brown (1965, p. 524) has recently reached the same con- 
clusion. à 

However it is of some significance that the construct validity 
of the F-Scale has been established primarily on data drawn from 
urban, middle-class Americans. Subjects from this group com- 
prised the sample on which the F-Scale (and its precursors) were 
built; they have appeared in great numbers in succeeding studies 
which have drawn from college subject pools. It is the thesis of the 
present investigation that, the vast number of F-Scale studies which 
have yielded significant and expected findings notwithstanding, the 
validity of this instrument has not been sufficiently established for 
populations other than the urban, middle-class group. It is possible, 
of course, that the measure is indeed valid for other populations 
but that researchers have merely failed to sample these groups 
during the course of their investigations. However, it will be the 
writers’ position that the F-Scale is not so valid for other popula- 
tions as it has been shown to be for the urban, middle-class group. 
In support of this general point, data will be presented which 
Suggest that, although the measure predicts quite well to a theo- 
retically-relevant; behavior for subjects born and reared in urban 
environments, it completely fails to do so for subjects born and 
reared in rural environments, 


fo (Жр Subjects and experimental setting. The present study 
Масы: «s upholstering department of a furniture man- 
g plant located in а Northeastern town of moderate size. 


The factory as a whole employs approximately 450 persons; the 


———————— 


?The experimental methods 
i ОЁ the present study correspond exactly tO 
those reported in Rosen and Sales (1966). AE (DAE described in 


some detail in that report, they will not be belabored here. 
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entire upholstering department (72 white males) served as Ss in 
this investigation. These Ss upholster living room furniture on & 
straight line basis, each line being composed of eight to ten men. 
An individual financial system is applied throughout the depart- 
ment in which the Ss work. 

Of the 72 Ss, thirty reported (in a questionnaire measure) that 
they had been born and reared in a rural environment ("farm"), 
whereas forty-two reported that they had been born and reared in 
an urban environment (‘small town, large town, or city"). The 
entire sample was split into two subsamples on the basis of this 
difference in early background. The two subsamples did not differ 
in terms of age, education, degree of union activity, or mean score 
on the California F-Seale. 

Experimental design. The present study was conducted during 
a period,of four sequential work weeks. Productivity data from 
the first two weeks of this period established a base rate against 
which similar data from the final two weeks could be compared. 

The independent variable in this investigation, as in the study 
reported by Rosen and Sales (1966), was merely the presence or 
absence of behavioral research operations in the department. Three 
days before the onset of the experimental period, each of the 72 
Ss received a letter on Cornell University letterhead informing 
him that research would be conducted in the plant and that both 
the plant management and the union had agreed to this research. 
On the following Monday, two graduate students entered the de- 
partment and began “research” operations. These students ob- 
served and interviewed the workers, took copious notes, and in 
general made themselves in evidence. They did not, however, make 
any other significant changes in the environment of the workers. 
The researchers remained in the department for two full work 
weeks; during the ninth day of this period they administered à 
lengthy questionnaire to the Ss. 

The dependent variable for this study was change in worker 
productivity following the onset of the experimental period. As 
noted above, each subject worked under an individual incentive 
system, This system (MTM) rates each worker's individual 
weekly productivity as a percentage of his base rate. The range of 
such ratings, for the 72 workers involved, was approximately from 
80 per cent to 200 per cent; as shown by prior research on these 
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Ss, this measure is a highly stable one. In order to arrive at the 
change scores employed in this investigation, each worker's per- 
centage of base for the two-week control period was subtracted 
from his percentage of base for the two-week experimental pe- 
riod. These change scores ranged from a gain of eighteen per cent 
to a loss of thirty-eight per cent. 

The questionnaire which was administered by the researchers 
during the ninth day of the experimental period contained, among 
other items, questions concerning various aspects of the workers’ 
backgrounds and current behavior. It also included a thirteen item 
version of the California F-Scale. This abbreviated F-Scale had 
been created by selecting those thirteen items from form 40 of the 
original scale which showed the greatest discriminative power for 
male samples; it had been extensively used in research performed 
at Cornell University (eg, Rosen and Sales, 1966; Salgs, 1964; 
Gruenfeld and Weissenberg, 1966). 

Hypotheses. It is suggested that a large increase in productivity 
during the period in which the researchers were present consti- 
tutes а form of authority-dependent behavior, whereas a large 

in productivity during the same period constitutes а 
counter-dependent behavior pattern. Two sets of data support 
these interpretations, 

In the first place, № seemed clear from the spontaneous com- 
ments of the Ss to the researchers (reviewed in Nord, 1963) that 
the two graduate students Were pereeived by the workers as in 
some way connected with the management of the factory. The 
researchers were consistently asked whether they were *manage- 
ment men," and the workers appeared quite interested in discover- 
x ше the company was “getting” from the research. Because 

perceived Connection between the researchers and the plant 


positis ity increase, then, would 
hs ponse to authority; а productivity decrease would 
тевропве. 
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groups. In partieular, workers who were inaetive in the union or 
who received high F-Seale scores tended to increase their pro- 
ductivity during the experimental period, whereas workers who 
were active in the union or who received low F-Scale scores 
tended to decrease their productivity during this period. (As re- 
ported by Rosen and Sales, these findings were statistically inde- 
pendent of each other.) These findings are what one would expect 
if an increase in productivity during the experimental period were 
a form of dependent behavior and if a decrease in productivity 
were a form of counter-dependent behavior. 

If increases in productivity actually do bear the hypothesized 
relationship to authoritarian trends within the personalities of the 
Ss, then these increases represent one meaningful criterion against 
which the F-Seale may be construet validated for various sub- 
groups within the sample. It is hypothesized that, inasmuch as the 
F-Scale has been shown to be valid for urban, middle-class sam- 
ples, it should show a significant positive correlation with produc- 
tivity increases for those Ss who were born and reared in urban 
environments. However, inasmuch as no clear evidence exists con- 
cerning the construct validity of this measure for Ss with rural 
backgrounds, it is not necessarily expected that a similar relation- 
ship should obtain for Ss from this group. Finally, it is predicted 
that the validity coefficient for the F-Scale within the urban sub- 
sample should be substantially larger than that for the rural 
subsample, 

Results. It was’ found that the correlation between the F-Scale 
and productivity increases within the urban subsample was 4-39, 
P < .005. (Inasmuch as the direction of all correlations was pre- 
dicted in advance, one-tailed tests will be employed throughout.) 
The parallel correlation for the rural subsample was found to be 
only +.04, non-significant. These correlations differed significantly 
from each other (p = .05) following the z transformation. There 
Was no suggestion of curvilinearity in the scattergram of either of 
these correlations, 

In order to determine whether the difference in validity co- 
efficients between the two subsamples could be due to range re- 
striction in the rural subsample, group variances in authoritari- 
anism scores and in productivity increases were compared by 
means of F tests, Although the urban subsample did show some- 
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what greater variance than the rural subsample on the F-Scale, 
this difference was not statistically significant. There was no differ- 
ence between the groups in variance on the criterion measure. 

A large mean difference on one variable between the two sub- 
samples, given either curvilinearity or lack of homoscedasticity in 
the overall correlation between the F-Scale and the criterion be- 
havior, could also lead to the observed differences in correlation 
coefficients without having any implications for the relative valid- 
ity of the F-Scale in the groups studied. There was no significant 
difference between the two groups in mean F-Scale score; how- 
ever, as reported by Rosen and Sales (1966), the mean productivity 
increase was significantly greater for the rural Ss. Therefore, the 
overall correlation between productivity increases and authoritari- 
anism was checked for both curvilinearity and nonhomoscedastic- 
ity. Neither of these conditions was found to obtain. Thus it 
appears that the difference in means on productivity change could 
not have accounted for the observed difference in validity coeffi- 
cients between the two subsamples, 

Finally, as noted above, the two subsamples did not differ on 
educational level, age, or level of union activity. These variables 
are thus apparently removed as possible sources of the observed 
difference between the two coefficients of construct validity. 

Discussion. As may be seen, the F-Scale does seem to have 
substantial validity for the urban subsample employed in the pres- 
ent investigation. This finding is interesting but hardly crucial, 
inasmuch as the construct validity of this instrument for this popu- 
lation seems by now to have been firmly established (e.g., Brown, 
1965; Christie and Cook, 1958). However, there is no evidence 
whatever in these data for similar construct validity of the F-Scale 
within the rural subsample. Furthermore, it seems clear that, 
whatever the actual validity of the F-Scale within a rural popula- 
tion, the measure is less effective as a predictor of behavior for Ss 
within this group than it is for Ss within an urban group. 
eae FRNA ws distinct, ways of explaining the observed 
in this study for b К tus Pam Yn — d 
the theory which rela aii Arst place, it is quite possible that 
Ын lates authoritarianism as a construct to the 

3 Бан n 18 correct for all Ss, but that the F-Scale items 
used in this investigation fail to measure to a high degree this 
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construct for a rural population. “Corrective measures,” in this 
ease, would consist of creating a separate authoritarianism instru- 
ment to be used only for rural subjects—a tactic already in use for 
the measurement of authoritarianism in Jewish Ss (Adelson, 1953) 
and in “authoritarians of the left” (Rokeach, 1960). On the other 
hand, it is possible that the theory of authoritarianism is correct 
only for individuals from urban environments. In this саве, a sci- 
entist interested in predicting authoritarian behavior in tural Ss 
would have to begin by specifying a new theory of “rural authori- 
tarianism.” In general, researchers have been predisposed to con- 
sider problems of the sort documented in this study as due to 
limitations in instrumentation rather than to inadequacies in 
theorizing; however, neither their data nor those presented herein 
can definitely point toward this alternative as the correct one. 

It should not be assumed, of course, that these data contradict 
the position of Adorno et al. (1950); indeed, these authors were 
quick to point out that their research might be applicable only to 
individuals in the groups from which they drew their samples. 
However these data do strongly argue against the assumption that, 
inasmuch as the F-Scale has been shown to be valid for one popu- 
lation, it therefore will be equally valid for all other populations, 
or even for all subgroups within a given setting. If one wishes to 
predict to authoritarian behavior within a population other than 
middle-class urban Americans, the present investigation suggests 
that one would be unwise to rely automatically upon the California 
F-Seale. 
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THE MEASUREMENT OF EMPIRICALLY 
DETERMINED VALUES 


LEON GORLOW ax» GARY A. NOLL 
The Pennsylvania State University 


Тніѕ report is a summary of the development of a new instru- 
ment fo» the measurement of values in a college population. The 
approach can be used as a model for the development of instru- 
ments for measuring values in other populations, viz., adolescents, 
the disadvantaged, the clergy, or individuals in later maturity. The 
basic position here is that the measurement of values needs to 
consider those values which are empirically derived for any given 
population; the Allport-Vernon-Lindzey Study of Values is prob- 
ably not a suitable description of the value system of most indi- 
viduals to whom it has been administered. 

Previous work. Another report (Gorlow and Noll, In press) 
described an empirical method for deriving values in a college 
population. Briefly, that paper reported the generating of 1500 
value statements by members of the population of interest (here, 
college students) ; the review and condensation of that list resulted 
in 75 distinct, unambiguous value statements. This final set was 
used as a Q-sort to which 105 persons responded along the dimen- 
sion “of highest value to me” to “of lowest value to me.” A prin- 
cipal components inverse factor analysis identified eight hypothet- 
ical types of people whose valuing attributes were specified by 
examining factor loading-item placement correlations. The types 
were subsequently named Affiliative-Romantics, Status-Security 
Valuers, Intellectual Humanists, Family Valuers, Rugged Individ- 
ualists, Undemanding Passives, Boy Scouts, and Don Juans. Demo- 
graphic data about subjects related meaningfully to the уш 
orientations, and this was taken as evidence for construct validity. 
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Procedure. Inasmuch as it is desirable to have an instrument for 
studying individual differences in values and since a Q-sort does not 
readily serve this purpose, the data of the earlier study were used to 
define the items for a new instrument which would permit scoring 
subjects along each of the dimensions which were identified. The 
items (value statements) were for the most part selected on the 
basis of high factor loading-item placement correlations. (Since 
each person has eight factor loadings and 75 item-placement scores 
from his Q-sort, it is possible to correlate factor loadings with 
item-placements.) 

A test format was adopted in which subjects would be required 
to rank value statements within three-item sets. Each set consti- 
tuted a triplet which is three value phrases, each from a different 
orientation. Since there were eight possible value orientations, 56 
triplets were generated, which is all of the possible combinations 
of eight dimensions taken three at a time (each dimension being 
represented by one statement within a three item set). Using trip- 
lets seemed more economical of time and effort than the ranking 
within pairs which constitutes the format, for example, of the 
Edwards Personal Preference Schedule, since one can arrive at a 
larger range of possible scores for a given number of sets. (As an 
aside, let one also note that inasmuch as value statements constitute 
valued goals or ideals, there is relatively little theoretical expecta- 
tion of the influence of social desirability in responding to test 
items.) 

Consideration of the test format led to the view that it would 
be wise to avoid duplicating pairs of similar items more than once 
in all of the triplets. For example, the writers wished to avoid 
Pairing the item “to be loved” of the Afliative-Romantie scale 
with the item, “to feel you own what you want" of the Status- 
Security scale more than once, although the Afüliative-Romantic 
scale would be paired with the Status-Security scale a total of six 
times. This could be effected by the device of using seven items 
for each value orientation, After all possible triplets were formed, 
both the presentation of value phrases within triplets and the order- 
ing of the triplets themselves were randomized. This then pro- 
vided a value instrument consisting of 56 triplets presented in 
forced-choice format and potentially providing measurement along 
eight value dimensions. It should also be noted that each value 
orientation appeared 21 times in the 56 triplets. 
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By the simple procedure of assigning a weight of plus one to the 
highest item in a triplet, zero to the middle item, and minus one to 
the lowest item, scores within a population varying from minus 
21 to plus 21 could be generated for each scale, since in the 56 
triplets each value orientation is represented 21 times. 

For each value orientation, seven value phrases were identified 
which were solely representative of that value orientation. The 
bulk of these items were obtained from an examination of the 
table of factor-item placement correlations in the earlier study. 
For example the items “to be loved," “to besin love," “to be 
sensitive to other's feelings,” “to be able to respond emotionally,” 
and “to make others happy" constituted five of the seven state- 
ments finally utilized in the Affiliative-Romantic set. Sometimes, in 
order to have seven items per group, it was necessary to augment 
a group pf items with new ones not used in the original study. In 
the саве of the Afiliative-Romantic orientation, the items “to be 
warm-hearted” and “to be affectionate” were added. 


Examples. Several examples of triplets follow’: 
To be relatively free of emotional attachments, 
To be free of religious constraints. 
To have little political concern. 


To have self control. 
To be involved in family life. 
To know you are the best at something, 


To be physically attractive. 

To relax and feel content. 

To be part of a family group. 
башааты ne ИА 


To seek the truth. 
To have little political concern. 
To be sensitive to others’ feelings. 


ИЕ 
To have physically active recreations. 
To know you are the best at something. 
To be free of entangling relationships. 


Preliminary standardization. In order to gain some impression of 


NU CAO а Я 
1The 56 triplets constituting the complete instrument are available from 
the authors on request. 
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the characteristics of the instrument, it was administered to 77 
introductory psychology students at the Pennsylvania State Uni- 
versity. It was observed that the approximate modal time for 
responding to the instrument was 20 minutes with virtually all 
individuals completing the task in 40 minutes. Satisfactory vari- 
ability on all scales of the instrument was observed. Scores ranged 
from —8 to +19 for the Affiliative-Romantic scale, from —16 to 
+9 on the Status-Security Valuers scale, from —13 to +9 on the 
Intellectual Humanist scale, from —18 to +18 on the Family 
Valuers scale, from —9 to +18 on the Rugged Individualist scale, 
from —12 to +16 on the Undemanding Passive scale, from —14 to 
+8 on the Boy Scout scale, and from —18 to +10 on Don Juan 
Scale. 

Frequency distributions were constructed and chi-square good- 
ness of fit was computed for each scale in order to determine 
Whether there were significant deviations from normality? All of 
the chi-squares except one were not significant indicating no sig- 
nificant departures from the normal distribution for seven scales. 
The single exception was in the Family Valuer scale where a chi- 
Square of 19.14 was observed, which with seven degrees of free- 
dom, is significant at the .01 level. On this scale, most individuals 
piled up at the high end of the Scale; that is, most people in the 
sample endorsed this orientation. Marked skewness was produced 
by the fact that à few people strongly rejected this orientation. It 
is possible that the age of the subjects was a significant factor in 
producing this result. It is also possible that the size of the sample 
was not sufficiently large to give a true picture of how college 
students might respond to this scale. Normative data for a larger 
sample of students (М = 700) are now being developed, and 


additional efforts directed toward reliability study and construct 
validation are underway. 
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SOCIAL DESIRABILITY AND THE TAYLOR 
MANIFEST ANXIETY SCALE, THE GENERAL 
ANXIETY AND TEST ANXIETY SCALES 


RICHARD M. SUINN 
University of Hawaii 


Tuis was a study of social desirability and response set effects on 
the Taylor Manifest Anxiety Scale (MAS) (Taylor, 1953), the 
General Anxiety Scale (GA) (Sarason, 1957), and the Test Anx- 
iety Scale (TA) (Sarason, 1961). Such influences on these meas- 
ures require study, particularly since the test items concern highly 
personal aspects of the Ss. 

Method. Ss were 39 college students who answered the tests 
personally and then rated the items on a nine point social de- 
sirability rating scale (Edwards, 1964). Biserial r’s between per- 
sonal answers (T or F) and ratings, and percentage of Ss answer- 
ing each question “False” were then calculated. The biserial r’s 
are judged to show the influence of social desirability on each item. 
The obtained percentages were inferred to indicate the inability 
of the items to measure anxiety because the operation of a response 
set leads to a high number of Ss answering in the same direction. 

Results. Significant biserial r’s were found for three MAS, one 
GA, and one TA items. Biserial r's approaching significance were 
found for three other MAS, two other GA, and no other TA 
items. Over 95 per cent of the Ss answered five MAS items “False” 
as compared to no GA and one TA item. The results on the GA 
and TA scales could have been obtained by chance. However, the 
incidence of significant data on the MAS suggests that the removal 
of certain items could eliminate social desirability or response set 
influences: items 14, 23, 24 were inferred to be high on social 
desirability influences; items 2, 10, 14, 21, 23 were judged to be 
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non-discriminating since 95 per cent of the Ss gave the same an- 
swer. Replication of this study is desirable to confirm the general- 
izability of the data to other populations of examinees. 
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THE SUGGESTED EFFECTS OF EXPERIENCE WITH A 
COLLEGE ENTRANCE EXAMINATION ON 
MEASURABLE AFFEOTS OF ANXIETY, 
DEPRESSION, AND HOSTILITY 


ROBERT R. KNAPP 
Educational and Industrial Testing Service 
WAYNE 8. ZIMMERMAN 
* DOUGLAS L. ROSCOE 
California State College at Los Angeles 
AND 
WILLIAM B. MICHAEL 
University of Southern California 


Ir probably would not be too difficult to find individuals who 
having been subjected to a series of examinations would testify to 
changes in affect during the testing session. It is more difficult, 
however, to quantify and objectively to score such changes. The 
Standard forms of temperament and personality inventories are not 
sufficiently sensitive to pick up mood fluctuations. 

The Multiple Affect Adjective Check List (MAACL) was de- 
signed as an instrument which would be sensitive to changes as 
evidenced by feelings of anxiety, depression, and hostility (Zuck- 
erman and Lubin, 1965). During the early stages of the develop- 
ment of the MAACL, Zuckerman administered the anxiety scale 
оп consecutive college class meetings one week apart. It was shown 
that measured anxiety increased significantly just prior to an ex- 
amination, Zuckerman and Biase (1962) and Zuckerman, Lubin, 
Vogel, and Valerius (1964) demonstrated comparable changes in 
affect in similar studies. 

. Despite limitations in the experimental design which did not 
include a control group receiving only a pre- and posttest of the 
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MAACL with innocuous intervening experience, the study re- 
ported here was intended to furnish tentative and suggestive evi- 
dence of changes in affect during a single aptitude and achieve- 
ment testing period of substantial duration (five hours). It was 
hypothesized that in terms of scores on three scales of the MAACL 
Students would record significantly greater feelings of anxiety, and 
especially of depression and hostility, at the end of a long and 
tedious (five hour) testing period than they registered at the be- 
ginning of that period. 

Method. The subjects were 306 entering freshmen at a large 
liberal arts college in southern California. The testing session ran 
from 8:00 o'clock in the morning to 1:00 o'clock in the afternoon. 
The Today Form of the MAACL was administered at the be- 
ginning and at the end of the session, The examinees were in- 
structed to check those adjectives which describe “how you feel 
now—today.” 

Results. Scores for the three affects before and after the exam- 
ination session are presented in Table 1. 


TABLE 1 


Means, Standard Deviations, and Tests of Significance of Difference between 
Pre- and Post-Examination Scores on MAACL Scales (N = 306) 


Pre-Session Post-Session Mean 
MAACL Scores Seores Difference 
Scale Mean SD Mean SD (Post-Pre-test) і 
Anxiety. 7.57 4.10 9.38 2.79 1.81 7.83* 
Depression 11.89 5.75 18.601 6.15 6.72 18.21* 
Hostility 7.15 3.15 12.96 4.96 5.81 20.39* 


“Significant beyond the .001 level; 


А Examination of Table 1 will show that all three affects increased 
significantly from beginning to end of the testing session. Increases 
were most, marked for feelings of depression and hostility. 

Comparisons between pre- and post-session measurements were 
made between males and females and with the college norm sam- 


ples presented in the manual (Zuckerman and Lubin, 1965). These 
data are presented in Table 2. 


Examination of Table 2 will 
ences between pre- 
both sexes (beyon 


show that very significant differ- 
and post-session means were demonstrated by 
d the .001 significant level). There were ap- 
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proximate standard score increases of 5 for Anxiety, 10 for De- 
pression, and 15 for Hostility. 

When the pre-session means were compared against the college 
norms, the present sample was seen to be more depressed (signifi- 
cant at the .05 level for both males and females), and the males 
were more anxious (significant at the .05 level). Neither the male 
nor female sample was significantly more hostile than the norm 

' group at the beginning of the session. Additional comparisons of 
the experimental groups may be made with means of hospitalized 
samples furnished by the MAACL manual. 

Intercorrelation of pretest, posttest, and difference scores be- 
tween posttests and pretests are cited in Table 3. 


TABLE 3 
Intercorrelations of Pretest, Postest, and Difference Scores between Postests 
and Pretests for the Total Samples (М = 806) > 

Test Measures (1) (2) (3) (4) (5) (6) (7) (8) (9 
(1) Pretest Anxiety — 70 54 26 28 19 —78 —34 —17 
(2) Pretest Depression SENSU 28 39 24 —48 —49 —21 
(3) Pretest Hostility 54 71 — 20 33 30 —38 —29 —33 
(4) Posttest Anxiety 26 28 20 — 76 70 39 45 57 
(5) Posttest Depression 28 39 33 76 — 77 21 58 55 
(6) Posttest Hostility 19 24 30 70 77 — 27 50 79 
(T) Anxiety Difference Score — —78 —48 —38 39 21 27  — 62 52 


(8) Depression Difference Score —34 —49 —29 45 58 50 62 
(9) Hostility Difference Score —17 —21 —33 57 55 79 52 #88 — 


Note.—Decimal points omitted, = 


Discussion, The present results would appear to reaffirm the 
sensitivity to fluctuation in mood of Multiple Affect Adjective 


Check List measures of Anxiety, Depression, and Hostility and to 


demonstrate the effectiveness with which the changes in mood 


among normal young adults can be measured. 
There was the strong Suggestion that substantial changes took 


place in affect level of students who have been exposed to a pro- 


longed examination experience (treatment variable). However, аз 
pointed out by Cam 


Dbell and Stanley (1963), the use of a one 


Eus Pretest-posttest design (Design 2 by their classification) 
oes not control either for five classes of extraneous variables 
which influence inte: 


ernal validity—history, maturation, testing, in- 
strumentation, and interactions of several of these characteristics— 
or for two classes of factors that jeopardize external validity— 
interaction effect of pretesting with the treatment and of selection 
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biases with treatment. In light of a consideration of the conditions 
surrounding the examination session, the most serious sources of 
possible contamination would appear on logical grounds to be test- 
ing (influence of pretest upon posttest exercise), maturation effects 
(such as fatigue, boredom, and hunger), and the reaction or inter- 
action effect of the pretest on the respondent/s sensitivity to the 
aptitude-achievement examination—a sensitivity that might be ab- " 
sent from an unpretested population of examinees from which 
respondents in the experimental group were selected. Future ef- 
forts involving use of Solomon's Four-Group Design in which one 
of the two randomly-chosen experimental groups and one of the 
two randomly-selected control groups receive a pretest test and 
the other two groups do not would offer a basis for minimizing 
confounding effects of extraneous variables and of furnishing a 
relatively valid conclusion regarding the impact of examination 
experience upon measurable affective traits, Although not admin- 
istratively feasible at the time this study was instituted, such a 
design might be advantageously employed at some future time. 

An interesting observation from the present study is that levels 
of the affects measured (see Table 1) approached or exceeded the 
scores presented in the MAACL manual for various hospitalized 
samples (see Table 2). This difference was especially true for the 
Hostility scale where the present post-session mean of 12.96 for 
the total sample compared with hospitalized sample means ranging 
from 6.7 to 10.8. The post-session Anxiety mean of 9.38 for the 
total group fell roughly halfway between the hospitalized sample 
means which ranged from 7.1 to 12.3, and the corresponding De- 
Pression mean of 18.61 placed near the top of the range of hos- 
Pitalized sample means which varied from 13.8 to 20.8. In noting 
the interaction of test-retest and split-half reliability estimates with 
hospitalized as contrasted with non-hospitalized samples, Zucker- 
man (1960) pointed out that individuals at the extremes of these 
affects such as chronic depressives may stay “reliably” depressed, 
whereas people in the normal population fluctuate in mood. Thus, 
9ne may conjecture that psychiatric classification is perhaps more 
clearly conceptualized as resulting not so much from degree of 
adverse affective states as from chronicity of adverse affective 
states, 

The intercorrelations of the pretest measures and the intercorre- 
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lations of the posttest measures reveal that the measures of the 
three hypothesized constructs of anxiety, depression, and hostility 
were substantially related. It may well be that the high degree of 
association among these three test variables reflected certain re- 
sponse sets associated with the five-hour examination experience 
that would be manifested by the desire of examinees to give re- 
sponses to impress the examiner with the general unpleasantness 
of the situation of having to take a long battery of examinations. 
Alternative hypotheses may be equally plausible in accounting for 
the high intercorrelations. The negative correlations among differ- 
ence scores with pretest variables also pose alternatives for their 
interpretation. 

Early replication of the study is planned with an improved de- 
sign and with diversification in measures so that the tentative in- 
ferences suggested may be carefully examined. 
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PREDICTIVE VALIDITY AND DIFFERENTIAL 
ACHIEVEMENT ON THREE MLA—COOPERATIVE 
FOREIGN LANGUAGE TESTS 


HENRY F. DIZNEY 
University of Oregon 
AND 
LAUREN GROMEN 
Kent State University 


Conrrontep with rapidly expanding interest in foreign lan- 
guage imstruetion in American education, the Modern Language 
. Association of America (MLA) and Educational Testing Service 
— (ETS) have developed batteries of tests held to be of improved 
relevance in five foreign languages. These tests are described in 
detail in the Handbook for MLA—Cooperative Foreign Language 
Tests (ETS, 1965). It is likely that the prestige of both MLA and 
ETS coupled with demands for foreign language placement. in 
colleges will lead to wide usage of these tests. 

Among the suggested uses for the test is the foreign language 
course placement of individual students (Handbook, p. 6). Pub- 
lished validity evidence is lacking regarding this purpose. The 
present study was undertaken in order to provide: (a) validity 
evidence in terms of course success, and (b) information about 
the tests’ ability to differentiate achievement between instructional 
levels. : 

Subjects. The Ss of this study were Kent State University stu- 
dents enrolled in three foreign languages during the 1965 winter 
quarter. The languages represented are French, German, and 
Spanish. At the time of testing, the Ss were beginning either ^ 
second, third, or fourth quarter of language instruction and һа 
Successfully completed the course or courses prerequesite. It was 
assumed that in most cases the Ss’ enrollment followed a regular 


instructional sequence. ) the tests 


Tests. As stated in the MLA Handbook (ETS, 1965 à 
к speaking, reading, and writ- 


; listenin; В 
Provide separate measures of 7 listening and reading are 


ing in each language. Of these skills, 


419 
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scored with objective keys whereas in the case of writing and 
“speaking a degree of scorer judgment is exercised. The speaking 
' tests were not used in this study because it was felt that the costs 
and time involved were prohibitive. Thus, data were obtained for 
three measures of skills: listening, reading, and writing in each of 
three languages; French, German, and Spanish. 

Two levels of test, L and M, are provided, each having two 
parallel forms designated A and B. The L level test is designed for 
use in the first year of language learning in college. The L level, 
form A tests were used in this study. Reliabilities are reported in 
the manual for each skill and language. The coefficients for listen- 
ing are: .89 French, .89 German, and .90 Spanish. For reading they 
are: .93 French, .90 German, .89 Spanish. For writing they are: 
95 French, .94 German, and .95 Spanish (Handbook, pp. 21-24). 

Results. Table 1 contains the intercorrelations for each language 


TABLE 1 
Intercorrelations and Multiple-Rs for Listening, Reading, Writing, and Total 
MLA Test Scores and Grades in a Second Quarter of French, German, 
and Spanish 
L 
Reading Writing Total Course Grade 


ning E ' uer .82 .54 
4 186 -50 
н 196 153 
"Total 758 
Multiple 59 
German (№ = 111) 
Listening 17 60 87 E 
Reading 58 186 133 
Waiting 88 МБ 
Total : я 
Multiple R ae 
Spanish (№ = 135) р 
Listening 66 .66 .81 .48 
нашу 168 182 MT 
MS 96 64 
Total 4 
Multiple R : 


mco. a 


а multiple-R of the three skills to course grade is given. The 
ыы from which these data were obtained were the second 
quarter of instruction in each language. It was felt that, validity 


TABLE 2 
Means and Standard Deviations in Three MLA Skills from Three Instructional Levels for 


French, German, and Spanish 
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coefficients at this level of instruction involving the test scores and 
course grades would be generally most applicable, since most 
placement decisions occur at the second instructional level. The 
MLA tests were administered during the first week of instruction. 
Grades represented the instructors’ final evaluation. Grade distri- 
butions were combined by language from 6 French, 8 German, 
and 8 Spanish sections taught by 3, 5, and 4 instructors, respec- 
1 

It may be seen from Table 1 that the multiple-R’s and “total” 
MLA zero-order coefficients are practically identical. Generally, 
the correlations between specific skills and course grades approach 
those for their total and course grades. This condition is also evi- 
dent from the relatively high inter-skill correlations. 

Table 2 contains descriptive data from the MLA tests by skill, 
language, and quarter of instruction. In a sense these date supply 
construct validity evidence, in that it may logically be assumed 
that successively advanced quarters of instruction should result in 
successively higher scores for such a test. The quarters desig- 
nated 1, 2, and 3 represent the last quarter completed prior to 
testing. Analysis of variance procedures were used to test the 
significance of the differences between quarters for each skill and 
language. All nine of the obtained F-ratios were significant at or 
beyond the .001 level. 

Summary. In summary, it should be noted that there was a sub- 
stantial degree of overlap in score distributions between quarters. 
"This is true for every skill and every language. Although this is а 
common condition for placement tests, it invites caution in apply- 
ing set test criteria for individuals in course placement. Perhaps, 
an improved design for a study of MLA tests, providing better 


information regarding placement efficiency, will be forthcoming 
from ETS. 
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THE PREDICTION OF THE LEARNING OF 
CHEMISTRY AMONG ELEVENTH GRADE GIRLS 
THROUGH THE USE OF THE STEPWISE 
AND DOOLITTLE TECHNIQUES 


AGNES Y. BAE 


Rosewood State Hospital 
Owings Mills, 


"ms A close examination of the literature indicates that few predic- 
tion studies have been done with respect to high school chemistry, 


Iowa Chemistry Aptitude Examination. The results of their study 
showed that it was not a valid indicator of chemistry aptitude, 
Several other studies explored validities of various prediotors which 
| included standardized intelligence tests for the criterion of chem- 
istry achievement (Cook, 1931; Edward and Wilson, 1059; Porter 
and Anderson, 1959; Winegardner, 1939), but these studios were 
limited in the kinds of variables examined and the degree of as- 


| 
| Hendricks and Johnson (1929) investigated the validity of the 
| 


curacy of prediction. 


In this study, it was hypothesized that combinations of existing 
measuring instruments could be identified which would maximise 

А the predictive validity. The primary purpose of the current re- 
search was to determine the contributions of various factors №0 


chemistry achievement among eleventh grade girls. 
Procedures. The subjects for this study were 117 eleventh grade 
girls in two high schools in California. Fifteen variables comprising 


a battery of six different tests were used as predictive bases They 
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were: (1) Iowa Chemistry Aptitude Examination (ICAE): I, 
Mathematieal Concept; (2) Iowa Chemistry Aptitude Examina- 
tion: II, Chemical Formula; (3) Iowa Chemistry Aptitude Exam- 
ination: III, Reading in Chemistry; (4) Iowa Chemistry Aptitude 
Examination: IV, Information in Science; (5) Differential Apti- 
tude Tests (DAT): Verbal Reasoning; (6) Differential Aptitude 
Tests: Numerical Ability; (7) Differential Aptitude Tests: Ab- 
stract Reasoning; (8) Differential Aptitude Tests: Sentences; (9) 
Test on Understanding Science; (10) Science Background: Things 
Done; (11) Science Background: Vocabulary; (12) Sequential 
Tests of Educational Progress: Science; (13) Science Aptitude 
Examination: Part A, Science Background; (14) Science Aptitude 
Examination: Part B, Reading in Science; and (15) Science Apti- 
tude Examination: Part C, Scientific and Mathematical Informa- 
tion. The predictor tests were administered to the subjects at the 
beginning of their first chemistry course, in September, 1963. 

The criterion tests consisted of two forms of the American 
Chemical Society-NSTA Examination in High School Chemistry: 
Form 1961 and Form 1963. Form 1961 was administered at the 
end of the first semester in January, 1964, and Form 1963 was 
administered at the end of the second semester in May, 1964, 
respectively. 

Results. The results were analyzed by both Stepwise and Doolittle 
methods of multiple regressional analyses. 

Table 1 presents the Stepwise analysis for the selection of the 
most efficient predictor variables for both January and May crite- 
ria. Inspection of this table reveals that the most efficient variables 
for the prediction of the January criterion consisted of four tests. 
They were: ICAE: Information in Science (var. no. 4); DAT: 
Verbal Reasoning (var. no. 5); DAT: Numerical Ability (var. no. 
6); and Science Aptitude Examination: Reading in Science (var. 
no. 14). 

In order to determine whether the multiple correlation based on 
the above four variables was as efficient as the one based on the 
initial fifteen variables, the difference between Ry.” 4,5,614 = (.7477) 


and Rie. 1,2,3,4,5,6,7,8,9,10.11,12,18 14,15 = (.7873) was tested for sig- 
nificance. The F test showed 1.47 which was not significant at the 


2 Variable no. 16 is the January criterion. 
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TABLE 1 
Regression Coefficients, Constants, Errors of Estimate, Multiple 
Correlation Coefficients, Coefficients of Multiple Deter- 
mination and F-Ratios as Obtained 


tn the Stepwise Analysis 
М = 117 
Entering ^ Regression Error of 
Step Variable Coefficient Constant Estimate В R? F Ratio 
JANUARY CRITERION 
1 5 .93 —8.81 9.21 .61 .37 68.03 
2 4 .66 —8.06 8.54 .68 .46 19.69 
5 .80 
3 4 .58 —14.86 8.05 .73 .53 15.47 
5 .61 
6 .50 
Y 4 4 .55 —20.70 7.82 .75 .56 7.73 
Я 5 49 
| 6 47 
* 14 .54 Е 
- * 
28 5 : 0 22.20 7.76 .75 .57 2.59 
6 .43 
8 .13 
14 .51 
MAY CRITERION 
1 1 1.71 —2.68 8.88 .58 .34 58.98 
2 1 1.26 —11.25 8.49 .63 .40 11.87 
5 42 
3 1 1.00 —10.19 8.22 -67 .44 8.69 
4 45 
5 .40 
4 1 ^4 —14.11 8.05 .69 .47 5.76 
4 45 
1 5 .33 
j 6 .33 
5 1 77 —13.29 8.01 .69 .48 2.19* 
4 .45 
5 .34 
6 .38 
15 .31 


*Not significant at the .05 level. 


05 level. This outcome suggested a nearly maximum predictive 


validity of the combination of these four variables. 
The selected variables with a maximum predictive efficiency for 


the May criterion consisted of the following four variables: (1) 
ICAE: Mathematical Concept (var. no. 1); ICAE: Information in 
Science (var. no. 4); DAT: Verbal Reasoning (var. no. 5); and 
DAT: Numerical Ability (var. no. 6). 
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TABLE 2 | 


Beta Weights and Multiple Correlation Coefficients 
for Certain Combinations of Predictors 


According to the Doolittle Method 
N=117 
Correlation M 
Variable with Beta Weights for Certain Combinations 
Code Criterion of Predictors 
JANUARY CRITERION 

1 43 —.14 

2 E .04 

3 “41 .12 

4 45 .23 22 .24 .26 .28 

5 .61 .22 20 .26 .92 .40 

6 55 +26 24 +25 .27 .29 

7 49 .06 

8 497 .10 13 14 

9 48 —.18 Р 
10 16 —.02 m 
n .50 11 а: 
12 .53 .15 11 
13 17 .08 | 
14 ESI +12 +18 19 .20 | 
15 24 =. | 
Rt .62 .58 .57 .56 .58 
R 79 .76 .75 .75 78 

F (for successive pairs of R) 1.30 2.03 2.58 7.72" 

MAY CRITERION 

1 .58 .21 .21 .23 .25 .84 

2 46 12 41 

3 43 16 11 Ц 

4 45 19 +21 .20 23 .23 

b .52 12 AT .21 23 +28 

6 51 19 18 19 21 

7 42 .00 

8 442 .01 

9 424 —.05 

10 .07 .04 
11 Al .06 
12 Al 01 
13 .09 .02 
14 .85 .09 
15 +13 -17 

——— E IA 

В .52 .49 .48 AT 44 
R À $ .72 -70 .69 .69 .67 
F (for successive pairs of R) .76 1.48 1.90 5.70* 


*Significant beyond the .05 level. 
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The Riz’. 1.4.5. = (.6865) was tested against the Riz. 1.2,3.4.8 6.7.8, 
9,10,11,12,13,14,15 = (.7208) to examine whether the former was signifi- 
cantly lower than the latter. An F ratio of 1.92 was obtained. This 
result indicated that the combination of these selected four predictors 
was probably nearly as efficient as that of the original fifteen pre- 


dictors. 
Table 2 presents data obtained from the Doolittle analysis of the 


January and May criteria. The data for the collection of fifteen 
predictors served as the bases for the successive selection of sub- 
sets of six, five, four, and three predictors. The variables which 
contributed most to the variance of the criterion were identified as 
those producing the highest cross products between their simple 
correlation coefficients with the criterion and their Beta weights. 

For the January criterion, ICAE: Information in Science; DAT: 
Verbal Reasoning; DAT: Numerical Ability; and Science Aptitude 
Examination: Reading in Science were selected as evidencing the 
simultaneous maximizing of accountable variance and minimizing 
of number of predictors. These tests were essentially the same tests 
as those identified by the Stepwise method. When the coefficients 
and the constant in Table 1 were used, the regression equation for 
the prediction of the January criterion becomes У; = .55X, + 
49Х5 + 47Х + 54X14 — 20.70. 

The most efficient predictors for the May criterion were: ICAE: 
Mathematieal Coneept; ICAE: Information in Science; DAT: 
Verbal Reasoning; and DAT: Numerical Ability. These were the 
same as those four tests which were selected by the Stepwise 
method. Using the coefficients and the constant in Table 1, the 
multiple regression equation for predicting the May criterion scores 
becomes Ya = .74X; + .45Х; + .33Хь + .33Хз — 14.11. 

Conclusion. The findings indicate that the most efficient pre- 
dietors for achievement in high school chemistry fall into three 
categories. The categories are: general intelligence, background 
experience in science and mathematics, and reading comprehension 
as measured by selected subtests of the Iowa Chemistry Aptitude 
Examination, Differential Aptitude Test, and Science Aptitude 
Examination. 


з Variable no. 17 is the May criterion. 
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VALIDITIES OF THE D-48 TEST FOR 
USE WITH COLLEGE STUDENTS 


MAYNARD E. BOYD! AN» GEORGE WARD II 
Marshall University 


Tue 0-48 test is an American version of the Dominoes tests 
which lave had some popularity in England and France for several 
years (Black, 1961). The research edition of the test, which is 
presented as a nonverbal test of general mental ability, has been 
the subject of recent studies (Gough and Domino, 1963; Domino, 
1964; Welsh, 1966). 

Purpose and method. The present study was designed to evalu- 
ate the D-48 for use with college students by comparing its re- 
liability and validity characteristics with those of the Raven 
Progressive Matrices Test, the Otis Quick Scoring Mental Ability 
Test (Form Gamma), and the Composite Score provided by the 
American College Test (ACT). Scores on these four predictors 
1 were obtained for 231 university undergraduates and subsequent 

end-of-semester grade point averages served as the criterion in- 
vestigated. 
г Results and discussion. Mean total scores оп the Den 48 were not 
Significantly different for males compared with females. The 
| Kuder-Richardson Formula 20 estimate of reliability was .85 for 
$ the 44-item test. For comparison with the longer (80-item) Otis, a 
Spearman-Brown estimate of a lengthened D-48 reliability was 
computed. The obtained figure of .91 was similar to the .88 cor- 
" rected split-half estimate reported for the Otis form when used 
1 with college students (Otis, 1954). 

Table 1 summarizes correlation coefficients pertinent to the con- 

gruent and predictive validities of the D-48. 


1 Deceased, 
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TABLE 1 
Correlations among D-48, Otis, Raven and ACT Composite 
Test Scores and Grade Point Average 
(N = 281)* 
Otis D48 Raven GPA 
ACT 70 42 .22 48 
Otis 57 .33 .37 
D 48 +39 +20 
Raven 16 


ЖАЦ coefficients significant at the .01 level 

The correlations are modest as academic performance forecast- 
ing typically fares, but the ranking of GPA coefficients reflects an 
order of complexity in the predictors if it is assumed that grade 
average is factorially complex. In addition, the coefficient of valid- 
ity for the D-48 would place it somewhere between the Otis and 
Raven in agreement with this assumption. A subsequent factor 
analysis of the D-48 confirmed this point (Boyd and Ward, 1966). 
It would appear that if either the Raven or Otis was employed in 
evaluating academic potential, the D-48 may serve as an acceptable 
adjunct instrument. 

Summary. The D-48, a nonverbal general mental ability test, 
was evaluated for use with college students in comparison with 
the ACT, Otis, and Raven tests. No differences in D-48 scores 
were related to sex, and the reliability estimate compared favor- 
ably with that of the Otis. For efficiency in predicting semester 
grade point averages, the D-48 ranked below the ACT and Otis 
but above the Raven. 
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PREDICTION OF FRESHMAN AND SOPHOMORE 
GRADE-POINT AVERAGES OF WOMEN PHYSICAL 
EDUCATION MAJOR STUDENTS 


LARETHA LEYMAN 
State University College, Cortland 


Tum purpose of this study was to investigate the validity of 
(1) the pre-admissions measures as predictors of freshman grade- 
point average (FGPA) and, (2) the pre-admissions measures and 
FGPA as predictors of sophomore grade-point average (SGPA) 
of women physical education major students enrolled at State Uni- 
versity College, Cortland, New York. Because of the nature of the 
major field a motor ability test was included as one of the pre- 
admissions measures. The validity of this measure as a predictor of 
freshman physical education activity course grade-point average 
was also considered. 

Variables. The pre-admissions variables were the State Univer- 
sity Admissions Examination (SUAE TOT), high school average 
(HSA), and the Scott Motor Ability Test (SMA). The State Uni- 
versity Admissions Examination total score and the aptitude and 
achievement subtest scores were each included as separate vari- 
ables. The high school average was the average of five final course 
grades and five regents grades. Final course grades used were those 
for second- and third-year English, intermediate algebra, biology, 
and American history. Regents grades were those for English, 
chemistry, physics, world history, and geometry. Freshman grade- 
point average was added as a variable in the prediction of sopho- 
more grade-point average. 

Criterion variables. The criterion variables were overall grade- 
point averages for the freshman and sophomore years and physical 


education activity course grade-point average for the freshman 
year. 
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Subjects. The subjects who initially numbered 201 were women 
students who enrolled in the falls of 1963 and 1964 and followed 
the physical education curriculum for one year. One hundred and 
sixty-two of the original group completed the second year of 
work. 

Results. Table 1 shows the results of the zero-order and multiple 
regression analyses between the pre-admissions variables and the 
criterion measures of freshman physical education activity course 
and overall grade-point average. 


TABLE 1 


Intercorrelations among Five Predictor and Two Criterion Variables and a 
Coefficient of Multiple Correlation (ЕТ) for 201 Freshman Physical 
Education Majors 


Variables 1 2 3 4 5 6 
1. State University Admissions DI 
Examination—Aptitude — 
2. State University Admissions 


Examination—A chievement S42 — 
3. State University Admissions 
Examination—Total .95* .89* — 
4, High School Average .42* .51* .50* — 
5. Scott Motor Ability Test —.04 .08 -.02 05 — 
6. Activity Course Grade-point 
Average .22* .27* — .27* .26* .80* 
7. Freshman Grade-point Average .33* .44* .41* .56* .10 .51* 


1.12345 = .60* 
Standard Error of Estimate = + .38 


‘Significant beyond the .01 level 


The zero-order correlations among all of the intellectual vari- 
ables and FGPA were significant. The multiple regression analysis 
yielded an E of .60, only slightly larger than the zero-order co- 
efficient of .56 produced by the correlation of HSA with FGPA. 

The correlation between the motor ability variable and activity 
grade-point average was .30, a low but significant correlation. 

Correlational analyses on the sophomore students included both 

- the pre-admissions measures and FGPA as predictors; sophomore 
grade-point average was used as the criterion. Table 2 reports the 
zero-order and multiple correlation coefficients. 

Again all zero-order correlations of intellectual pre-admissions 
measures with the criterion SGPA were signifiant; however, the 
highest correlation was between FGPA and the criterion. This is 
in line with findings in similar studies. A measure of achievement 


E 
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TABLE 2 


Intercorrelations among Six Predictors and a Criterion Variable and a 
Coefficient of Multiple Correlation (Вт) for 162 Sophomore 


Physical Education Majors 

Variables 1 2 3 4 5 6 
1, SUAE APT — 
2. SUAE ACH 2 — 
3. SUAE TOT 5,95 .889* — 
4. HSA leg 48* — 
5. ВМА —.11 .01 —.10 .00 — 
6. FGPA .28* ‚86° .94* .56* .07 
7. SGPA .27* .81* .92* .43* .05 71 


17.123456 = .71 
Standard Error of Estimate = + .33 


“Significant beyond the .01 level 


in college is a better predictor of future college success than are 
pre-admissions measures. 

Sunfmary. 1. High school average and the entrance examination 
measures were significant predictors of FGPA. High school aver- 
age as defined in this study was the single best predictor. 

2. The motor ability measure was a significant predictor of 
freshman activity course grade-point average; however, the rela- 
tionship between the predictor and the criterion was so low that it 
was of little value. 

3. Freshman grade-point average, high school average, and the 
entrance examination measures were significant predictors of 
SGPA. Freshman grade-point average was the best single pre- 
dictor. 

4. The use of multiple measures for prediction was not war- 
ranted. Increase in validity was negligible when compared to the 
use of HSA for prediction of freshman grade-point average or 
that of FGPA for prediction of sophomore grade-point average. 
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PREDICTIVE VALIDITIES OF THE ACT, 
SAT AND HIGH SCHOOL GRADES FOR FIRST 
SEMESTER GPA AND FRESHMAN COURSES 


WILLIAM R. PASSONS 
Temple University 


Problem. The purpose of this study was to determine predictive 
validities of two standardized tests and high school grades as pre- 
dictors of first semester GPA and of grades in ten general educa- 
tion courses, 

Predictor variables. The nine predictors included the Verbal 
(V), Mathematical (M), and Total (T) scores of the College 
Entrance Examination Board Scholastic Aptitude Test (SAT); the 
English (E), Mathematics (M), Social Science (SS), Natural Sci- 
ence (NS), and Composite (C) scores of the American College 
Test (ACT); and the average of high school recommending grades 
(HSRG), which were A’s and B’s in academic courses. Both tests 
were administered as part of pre-registration at Fresno State Col- 
lege (FSC). 

Sample. Subjects included 882 FSC freshmen, 376 men and 506 
women, who completed the 1963-64 academic year at FSC. Subsets 
of this sample were used in computing validities in the selected 
courses. 

Criteria. The eleven criteria were: (1) First semester GPA, (2) 
Biology 2A—Life Science, (3) English 1A—Reading and Com- 
position, (4) English 1B—Introduction to Literature, (5) History 
1—Western Civilization, (6) History 10—American, (7) Geology 
1—Physical, (8) Mathematics 3—Calculus, (9) Physical Science 
10—Introduction, (10) Psychology 7—Introduction, (11) Sociol- 
ogy 1A—Introduction. 

Validity data. Table 1 contains a summary of the correlational 
data and the N used in each analysis. 
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TABLE 1 


Zero-Order Correlations* of ACT, SAT, and HSRG with GPA and 
Grades in Ten Courses 


SAT SCORES ACT SCORES 

CRITERIA — HSRG омат E M 88 NS C 
Sem. GPA 4l 39 26 19 33 21 33 28 36 
(882) 

Biol. 2A 37 50 48 56 43 36 38 45 51 
(140 

Engl. QA 38 35 20 29 41 12 25 15 28 
(463) 

Engl. 1B 20 30 08 22 28 —06 27 08 16 
198) 

ER 47 56 27 48 45 22 51 31 48 
(223) 

Hist. 10 34 46 18 36 36 19 41 33 41 
(252 

Geol. | 31 40 40 48 37 25 39 26 42 
(253) 

Math. 3 20 34 45 47 34 38 33 35 47 
(110) 

P. Sci. 10 4l 48 57 58 4l 48 37 49 55 
(103) 

Pay. 7 34 41 26 18 31 23 40 29 39 
(676) 
"Бос. ТА 21 46 17 37 аата al 38 
(182) 


*Decimal points have been omitted. 
*Denotes N used to compute correlations for each cirterion, 


Findings. HSRG yielded the highest predictive validity for first 
semester GPA, but the test scores had slightly higher validities for 
predicting grades in courses. However, in neither comparison were 
the differences between the highest and the next highest validities 
of any practical significance. 

From Table 1 the validity of each predictor for each of the 
criteria may be ascertained. Ranking the nine predictors in terms 
of validities for each criterion and summing these ranks across all 
criteria reveals SAT-V as the most productive single predictor for 
these criteria, followed closely by ACT-C. It does not appear that 


either the SAT or the ACT clearly surpasses the other in predic- 
tive power. 
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STUDENT ABILITY: ITS EFFECT ON 
READING-PERSONALITY RELATIONSHIPS 


CAROLYN M. NEAL 
University of California 
Santa Barbara, California 


Tue purpose of this study was to find rudimentary patterns (for 
a freshman student group entering the University of Illinois col- 
lege of education) of the relationship between reading test scores 
and Minnesota Multiphasic Personality Inventory (MMPI) scores 
and between reading test scores and Kuder Preference Record 
(KPR) scores for sub-groups of individuals who were: 


(a) high on the School College and Ability Test (SCAT). 
(b) low on the SCAT. 

(c) high in grade point average (GPA). 

(d) low in GPA. 

After the pattern of relationship was found between MMPI 
Scores and reading test scores for the total population, attention 
was directed to finding the difference in pattern and size of cor- 
telation coefficients for the sub-groups who were (a) high on 
SCAT versus (b) low on SCAT and (c) high in GPA versus 
(d) low in GPA. 

Again, after the pattern of relationship was found between KPR 
Scores and reading test scores for the total population, the differ- 
ences were sought in pattern and size of correlation coefficients 
for the sub-groups who were (a) high on SCAT versus (b) low on 
the SCAT and (c) high in GPA versus (d) low in GPA. 

In terms of the measures used, it was hypothesized that per- 
Sonality differences exist between the able and the highly able 
reader. 

Procedure, A battery of tests (the KPR), the Cooperative Eng- 
lish Examination (CEE), and the SCAT given to University of 
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Illinois freshmen entering the field of education (Fall 1961) re- 
vealed, as might be expected, a wide range of scores reflecting 
individual differences in both academic ability (measured by the 
CEE) and personality (measured by use of the KPR and the 
MMPI). Approximately 80 per cent of the students selected for 
the field of education were from the top one-fourth of their class 
on the SCAT test, but a small proportion who earned scores in the 
lower ranges were accepted, inasmuch as the institution is state 
supported. Those students who survived freshman attrition were 
administered the MMPI at the beginning of their sophomore year 
and were granted advanced standing in the College of Education 
after they had been judged as falling within the normal range of 
scores on each personality trait. Thus, the members of the sample 
of 517 students (being “nondropouts” and psychologically “nor- 
mal”) were less likely to be highly deviant with regard to,either 
yariable, (personality and reading ability) than would an unse- 
lected sample. 

After a general trend of relationship between measures of 
personality and reading ability was found, the sample population 
was divided into two groups—those persons who earned a grade 
point average (GPA) of 3.5 and above (on a five point scale where 
A=5,B=4,C=3,D=2, and F = 1) and those persons who 
had a GPA below 3.5. 

Following analysis on the basis of grade point average, the sam- 
ple was recombined and re-examined on the basis of SCAT Q 
scores, а measure of the general ability of students, The correla- 
tion of the two major variables was found after dividing the sub- 
jects into those falling into the top 26 per cent and the lowest 25 
per cont of the tested group in order to draw sharp contrast of the 
most able with the least able in the sample population. The limited 
range of ability makes low correlations almost inevitable; thus the 
99. might be considered exploratory rather than predic- 

Tests. As indicated previously, the variables in the personality 
domain consist of measures on the MMPI, two measures of validity 
of response and one measure of test-taking attitude, and the fifteen 
vocational preference and aptitude scores on the Kuder Preference 

Record. The variables in the reading ability measure were: reading 
speed of comprehension, reading attempts score (speed of read- 
ing), reading vocabulary score, spelling score (word analysis), 
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grammar and punctuation, and the SCAT verbal ability 


ts. Linear correlations for the entire population, determin- 
ie relationship between personality characteristics and reading 
y, were not high, but many were significant. 

И population. Linear correlation indicated that the following 
teristics were generally negatively correlated to reading abil- 
Table 1). 


TABLE 1 


“Linear Correlation Coeficients between Variates in the set of Personality 
_ Scales and the Set of Reading Measures Using Whole Sample of 
517 Subjects 


Reading Measures 
sp-com SCAT vocab spell gram read-at 


—10* —09** —08** —11* —13* —09* 
—09*  —09* —O1 —09* —10* —03 
—09* —09* —06  -—13* —12* —06 
—09* —14* —08 —07 —08 +02 
—02 +02 +02 —10* —07 —05 


-10 -—06  —07 —19* —18 01 
—14^ —21* —16* —23* —239* —07 
+09* +404 405  —03 -01 -03 


—04 —11* 21—07 —08  —08**  —02 
—02 —01 +02. —09* —11* —02 
—13 —1i* -—10* -—04 -—11* —08** 
+08 —01 +12* +03 +08 +08 


+11* +17* 412 410% +14* -+09* 
+10* +06 +10* +06 +08 +01 
+02 —08** —10* +407 +12* +04 
+07 +14 +09* +03 +07 +09* 
T19*  -27*  --25* 414" +11* +17* 
—04 —07 —09* —07 —04 —08 
—02 —1* —07 +05 +02 —01 


= Speed of t= i . 
B is 05 ee reed. Reading attempts score. 


nt at the .10 level. 
omitted, 

g. Obsessive-compulsive 

h. Schizophrenia 

i. Hypomania 

KPR 

j. Some areas of Computational 
Social Service and Clerical 


and Hysterical 
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Positively correlated to reading ability were the personality 
characteristics: 


MMPI KPR 
a. Paranoia 7а. Agreeable 
b. Introversion e. Scientific 
е. Theoretical f. Literary 


and some aspects of computational 


The sample, taken as a whole, did not reveal high correlation 
between reading ability and personality. However, the pattern of 
the correlation suggested the inference that neurosis may be a 
factor associated with decreased reading performance. Inspection 
of sample correlation indicated that the able reader placed rela- 
tively low on traits thought to contribute to a neurotic syndrome, 
but tended to have theoretical, literary interests, and tended to- 
ward an agreeably introverted nature, 

Division of the sample on the basis of GPA. The sample of 517 
subjects when divided on the basis of grade point average num- 
bered 76 persons having a GPA of 3.5 and above, (hereafter re- 
ferred to as honor students) and 271 persons having a GPA aver- 
age below 3.5 (nonhonor students). Grade point averages were 
not available for 170 students; these subjects were not used in this 
analysis. 

Honor students. The personality variables among persons having 


a GPA of 3.5 or higher which were found to be negatively related 
to reading were (See Table 2): 


MMPI е. Pathological Sexuality 
а. L Scale f. Schizophrenia 

b. Depression KPR 

c. Hysteria g. Sociable 


d. Psychopathie Deviate h. Musical 


Linear correlation showed that among these same high achieving 


Students, personality variables which were positively correlated to 
reading ability were: 


KPR 
а. Computational b. Literary 


Set of Reading Measures— Division 
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TABLE 2 
Linear Correlation between Variates in the Set of Personality Measures and the 


Grade Point Average, Education Majors Having 
3.6 Averages and Above (N = 76) 
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of Data on the Basis of 


Person. r with 

Variab. sp-com SCAT vocab spell gram 
MMPI 

L Sca —27*  —26*  —30*  —20** —19** 
F Sea +01 +02 +02 -08  —25* 
Нурос -17 —08 —06 —20** —22** 
Depress —08 -17 —10 —25* —31* 
Hyster —18 —14 —09 —33* —39* 
Psych Dev +04 +12 +05 —18** —20* 
Path Sex —15 —23* —25*  —44* —37* 
Schizo +01 +12 +06 —25* —35* 
KPR 

Sociab —09 -17  Á—21* -10 —18** 
Domin +16 —07 —03 +03 -ii 
Outdoor +10 +4 +19* +07 —02 
Mechan +07 +05 +06 +09 +04 
Comput —04 -—11 +06 +19* +31* 
Liter +26*  -F37* 441* 410 +09 
Music —24* -06  -—07  -—1i9** —21** 


Key: sp-com = Speed of comprehension. read-at = Reading attempts score. 


*Significant at the .05 level. 
"Significant at the .10 level. 
"Grade point averages were not recorded for all subjecta. 


read-at 


—08 
+05 
cen 19** 
—05 
- 29* 
+08 
—15 
+08 


= 
+24* 
—08 
+21** 
—07 
+18* 
—38* 


Nonhonor students. Among those persons having a GPA lower 
than 3.5, personality variables that were found to be negatively 
related to reading ability were (See Table 3): 


MMPI 
a. L Scale 
b. F Scale 


с. Hypochondriasis 
d. Depression 
€. Psychopathie Deviate 


f. Pathological Sexuality 
g. Obsessive-Compulsive 
h. Hypomania (very high) 


KPR 

1. Practical 

j. Computational 
k. Social Service 
1. Clerical 


Further, linear correlation. showed that among these lower 
achieving students personality variables which were positively cor- 
related to reading ability among the high achieving students were 


not the same as th 


ose personality characteristics among low per- 


forming students. However, the scales of L, Depression, Psycho- 
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TABLE 3 
й orrelation between Variates in the Set of Personality Variables and the 
deco Reading Variables—Division of Data on the Basis of Grade 
Point Average, Education Majors Having Averages 
Below 3.6 (N = 271) 


r with 
Person. Variab. sp-com SCAT vocab spell gram read-at 
MMPI 


L Sca —08 —08 —06 —15* —16* —09 
F Ein —12 —12* -—06 —05  —07 —07 
Нуров —04 —08 —07 —10* —08 —01 
Depress —07 —09*  —01 —04 -04 +08 
Psych Dev —06 —05 =05 —19* —19* +02 
Path Sex —10* -17* -—11** —14* —15* —05 
Obses-Comp —02 —1i* -—11* -08  —09 zl 
ypoman —18*  —14* —17* —10** —14* —12 
Introver +11* +04 +15* +09 +13* +07 
KPR 
Pract —05 -11* -—07 —04  —06 —19 
Theor +10** +14" 410% -+11* ++14* 08 
Agree -H8*  -H3*  -H15* +06 +11 +02 
Comput +02 —04  —1* +06 +08 +05 
Scient +10** +15" +15 407 410% чи 
Persuas —02 +05 —03 +03 04 +12* 
Liter +18*  -27*  4-22* 416" +11" +16* 
Music +12* +05 +407 +02 +09 +00 
Soc-ser —09 SA 1e 1100141] —13* 
Cler —08  —13* -13* -04 15 —08 
Key: sp-com = Speed of comprehension, read-at = Reading attempts score. 
*Significant at the .05 level. 


"Significant at the .10 level. 


pathic Deviate, and Pathological Sexuality correlated negatively 
for both high-standing and low-standing students. By high nega- 
tive correlations on such scales as Obsessive-compulsive and the 
F Scale, the low achieving student revealed a pattern of personality 
which could be interpreted as introverted and somewhat neurotic. 
His interests appeared ingrown as well. 

Positive correlations between reading ability and personality at- 
tributes for the two sub-groups of students were different but in 
the same direction. Highly correlated with reading ability for the 
low achieving student were qualities of an agreeable, ideational, 
and introverted nature. His interests revealed a cultural, theoretical 
orientation. For the better Student, very few positive correlations 
were indicated. 

Division of the sample on the basis of SCAT ability tests. When 
the sample was divided on the basis of SCAT ability deciles, there 
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were 125 persons in the top ten per cent of the group (the highly 
able students) and 143 in the lowest forty per cent (the less able 
students). These two groups represented approximately the top 
fourth and lowest fourth of the tested group. All students had 
SCAT decile scores recorded, but the middle portion of the popu- 
lation was not considered; the marked restriction of range may 
well have been a factor in reduced correlations, but afforded the 
possibility of greater contrast between the most able and the least 
able in the sample population. 

The highly able students. The data pertaining to the following 
findings have been deposited with A.D.I.;1 only essential informa- 
tion is reported. 

The personality variables among persons in the top ten per cent 
on the SCAT ability test found to be negatively correlated to 
readingability were: 


MMPI d. Hypomania 
a. L Scale KPR 
b. Hysteria e. Computational 


c. Pathological Sexuality f. Clerical 


Linear correlation showed that among these same students per- 
sonality variables that positively correlated to reading ability were: 


KPR 
8. "Theoretical c. Literary 
b. Agreeable d. Musical 


The less able students. Among those persons indicating less abil- 
ity (testing in the lowest forty per cent on this test) personality 
variables found to be negatively correlated to reading ability 
(which were reported in Table 5 deposited with ADI) were: 


MMPI 


a. L Scale d. Pathological Sexuality 
b. Hypochondriasis e. Obsessive-compulsive 


с. Psychopathic Deviate f. Schizophrenia 


ee 

1 Tables 4 and 5 have been deposited with American Documentary Institute: 
Order Document #9677 remitting $125 for 35mm. microfilm or $125 for 
6" by 8” photocopies. Order from ADI Auxiliary Publication Project, Photo- 
duplication Service, Library of Congress, Washington, D.C. 20540. Make checks 
payable to Chief, Photoduplication Service, Library of Congress. 
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Linear correlation of the variables within the two major varia- 
bles showed that among those students having lower ability scores 
personality variables that positively correlated to reading ability 
were: 


KPR 


a. Theoretical е. Literary 
b. Persuasive 


A cursory examination of significant correlations between per- 
sonality variables and reading ability for the sample divided on the 
basis of SCAT scores revealed that there are comparatively few 
significant correlations. Therefore, it would appear that ability 
scores (closely related to intelligence), when used as a basis for 
separation of the sample, did not afford discriminating informa- 
tion. However, the personality variables which related negatively 
to reading ability indicated that the more able student had a differ- 
ent personality pattern from the less able student with regard to 
reading ability. 

Summary and conclusions. A division of the sample on the basis 
of grade point average suggested that the low-achieving student 
exhibited a personality pattern related to reading ability different 
from that of honor student. The poor reader who achieved high 
in school work reflected a kind of “acting-out-neuroses” person- 
ality syndrome in terms of patterns of MMPI Scores. However, 
the poor reader who did not achieve honor standing revealed a 
personality pattern which was interpreted as extroverted and in- 
dicative of neurotic tendencies. High negative correlations on such 
scales as Obsessive-compulsive, Hypomania, and the F Scale sug- 
gested this pattern. His interests were more outgoing as well. 
Positive correlations revealed that the better-reading, nonhonor 
Student had high cultural and theoretical interests combined with 
tendencies toward introversion. 

A division of the sample on the basis of SCAT ability tests also 
indicated that the more intelligent student revealed a personality 
pattern different from the less intelligent student as it related to 
reading ability. Hysteria and Hypomania indicated that the “overt- 
neurosis” type of personality tended to be associated with lower 
performance among able students, whereas the less able college 


student tended to have “conversional types” of neurosis as meas- 


24 
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ured by Hypochondriasis, Psychopathie Deviate, and Schizophre- 
nia. Thus the lower achiever appeared to be related in a much 
more highly neurotic way to the reading task than was the student 
in any of the other categories. Considering the stress upon grades 
in our schools, this finding is hardly surprising. 


Y ——— 
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THE PREDICTION OF TRAINEE SUCCESS IN A 
MANPOWER DEVELOPMENT AND 
TRAINING PROGRAM! 


DONALD SOMMERFELD anv FRANK A. FATZINGER 
Western Michigan University 


A degade ago formal retraining programs were almost unknown. 
Today they receive much attention from industry and State and 
Federal government. In 1960 and 1961, two of the first formal 
retraining programs were undertaken (Auman, 1962). Successful 
completion of retraining was a major problem in both of these 
programs, 

In 1962, the Federal Government passed the Manpower Devel- 
opment and Training Act (MDTA). This act was focused on . 
training unemployed members of the labor force. The national 
average of dropouts in MDTA programs is approximately twenty- 
five per cent. (U. S. Department HEW., 1965). Because the cost of 
vocational training as conducted under MDTA is high, any trainee 
who drops out without successfully mastering the occupation is a 
costly loss. 

A review of recent literature of prediction of trainee success or 
failure in vocational training showed a few small gains being made 
in reduction of dropouts through better selection. Compared to the 
wealth of information on prediction of academic success, a short- 
age of research in this area exists. 

Strength of measured interest on vocational interest tests was 
found to be predictive of subsequent success in Navy vocational 


ee 
1This research project was undertaken by private citizens and should not 


be construed as being an official report of either the United States Depart- 
ment of Labor or the United States Department of Health, Education and 
Welfare. The investigators wish to thank the Michigan Employment Security 
Commission for their cooperation. 
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training (Gordon and Alf, 1962). A combination of achievement 
and intelligence tests was found to be predictive of dropouts in 
trade school courses, but little predictive value was found in an 
investigation of biographical data (Patterson, 1956). The subject’s 
ability to follow instructions in a test situation was found to be 
predictive of succéss in military recruit training (Stern and Gor- 
don, 1961). Intelligence, prior grade level, and arithmetic achieve- 
ment were found to be predictive of traince success at the Michi- 
gan Veterans Vocational School (Graybiel, 1959). 

This study was undertaken using 320 male trainees referred to 
ten Muskegon MDTA programs.* The objective of this study was 
to identify items in each trainee’s record which could.differentiate 
between success and failure. These items could then be weighted 
and used to predict potential success or failure of future applicants 
for training, On the basis of this information, potentially» unsuc- 
cessful trainees could be eliminated from training at the time of 
trainee selection. К _ 

. Method. One group of trainees (N = 224) was used to, deter- 
mine which variables would make the best predictors and what 

. optimum weight should be associated, with each predictor. These 
trainees had ended training before April 1965 under MDTA in 
Muskegon, Michigan. The second group of trainees (N, = 96) 
was made up of Muskegon MDTA trainees that ended “training 
- after April 1965. This second group of trainees was used to cross- 
validate results obtained with the first group of trainees; 

Because no single occupational class was large enough for this 
study, trainees were picked from ‘six related occupational classes 

. (see Table 1). Members. of both groups were adult males, ages 
19 to 50, from the Muskegon area. All members of both groups 
were trained in skilled, blue-collar, manipulative occupations. 

To set up a practical selection procedure, a pass-fail criterion 
group dichotomy .was used. Other criterion possibilities were ex- 
plored before final selection of this simple criterion. All graduates 
were counted in the successful group.  . 

Trainees who did not graduate were determined to have dropped 


1The first MDTA program in Muskegon, Michigan began in Septemb 
1902. By 1965, the programs included 354 graduates, 249 dropouts and 318 


currently enrolled trainees. These have had inui 
nie of 28%. programs have а continuing dropout 


4 


SOMMERFELD AND FATZINGER S 1187 


TABLE 1 
Occupational Classes Included in Sample Groups 
—— III 


Initial Group (N = 224) Cross-Validation Group Ne = 90) 
Class umber Class Number 
Turret Lathe: (Setup) 35 Wood Machine Operator 25 
Metal Machine Operator 23 Auto Mechanics 25 
Wood Machine Operator 75 Truck Mechanics 21 
. Auto Mechanics 20 ^: Auto Body Repairman 25 
"Truck (Diesel) Mechanics 23 
Auto Body Repairman 48 
Total 224 .. Total 96 


with good cause or without good cause (see Table-2). Being 


dropped with good cause usually occurred when a trainee became 
1 ill or when.he quit to become employed. Being dropped without 


Bood eguse usually oceurred when a trainee had poor attendance 
or demonstrated poor progress or when a trainee terminated with- 
out а stated reason. Dropouts without good cause Were counted in 


‘the failure group. The trainees that dropped with good cause 


were eliminated from the study since these trainees could not be 
called either good or bad selections. > . 
Eighteen test and nontest predictors were identified for each 


- trainee (see Table 2). The eight nontest predictors were bio- 


graphical. 

The ten test prédictors wered based on test scores obtained in the. 
General Aptitude Test Battery (GATB). The predictor, minimum 
test scores; was recorded as a “yes” or “no,” meaning the trainee 
did or did not meet the minimum test scores used for his occupa- 


_ tional training area on the GATB. 


"The other nine test derived predictors were the sub-tests on the - 
General Aptitude Test Battery. These sub-tests are: Intelligence, 
Verbal Aptitude, Numerical Aptitude, Spatial Aptitude, Form Per- 


- ception, Clerical Perception, Motor Coordination, Finger Dexter- 


ity, and Manual Dexterity. 
The data on the initial group of trainees (N = 224) were 


. analyzed with the help of the Western Michigan University Com- 


puter Center. The first task in this analysis was to determine which, 
if any, of the eighteen predictors significantly differentiated be- 


| tween the two criterion groups. 


The significant predictor items were combined into an optimum 
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TABLE 2 
Predictors and Criterion Classifications 


Non-Test Predictors 
1. Weekly Training Allowance 


7. Formal Education "n 
8, Unemployment Compensation Recipient 


Test Predictors 
9. Minimum Test Norms (GATB)* 

10. Intelligence (GATB) 
11. Verbal Aptitude (GATB) 
12. Numberical Aptitude (GATB) 
18. Spatial Aptitude (GATB) 
14. Form Perception (GATB) 
15. Clerical Perception (GATB) О 
16. Motor Coordination (GATB) 

. 17. Finger Dexterity (GATB) 
18. Manual Dexterity (GATB) 


Criterion Classifications 


2, Dropouts with good cause N= 
3. Dropouts without good cause N = 


“General Aptitude Test Battery 


39 
55 


prediction formula using a multiple regression equation. The same 
significant predictor items were then used with the cross-validation 
group (N = 96) in the same multiple regression equation, 

Results. Each of eighteen tests and nontest predictor variables 
(see Table 2) was compared with the good-bad criterion in order 
to isolate usable predictors. Only five of these eighteen prediction 
variables (see Table 3) were shown to be significant positive pre- 
dictors (.05 level) with the initial group (N = 224). As seen in 
Table 3, age, general intelligence, and spatial aptitude were sig- 
nificant at the .01 level when comparing the extreme good-bad 
criterion groups of very good grads vs. drops without cause, 

The five “best” prediction variables (based on the t-test used on 
initial group) were then used with the initial group to set up a 
multiple regression equation. These same five predictor variables 
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TABLE 3 
Level of Significance Obtained with “Best” Predictor Variables 
ere ee 
n Criterion Groups 


Very good grads 
v8. drops with- All grads vs. drops 
Predictor out good cause without good cause 


Variables М = 97 М = 185 
Age 3.597** 3.223** 
"Unemployment 
Compensation 
Recipient 1.918 2.117* 
General 
Intelligence (GATB) 3.159** 2.251** 
"Verbal Aptitude 
(GATB) 2.484* 1.862 
Spatial Aptitude 
(GATB) 3.406** 2.296* 


o 
and the obtained multiple regression equation were also used with 
the second or cross-validation group. 

The intercorrelation of the five prediction variables for the ini- 

tial group (see Table 4) showed high relationships among the 
- three GATB measures (intelligence, verbal, spatial). Correlation 
with the success-fail criterion (see Table 4) showed age, intelli-, 
_ gence, and spatial aptitude to be the best predictor variables. 
Several prediction variables that had face validity did not sur- 
‘vive the screening process. These variables which were discarded 
\ because they did not demonstrate a statistically significant rela- 
tionship to the training success measure were: level of formal 
_ education, amount of weekly allowances, number of dependents, 
- and ability to meet minimum GATB norms. 


TABLE 4 
Intercorrelational Matriz for All Graduates vs. Drops without Good Cause 
Е 
Age U.C. Intell. Verbal Spatial ^ Suoccess-Fail 


Age 1.00 —.00 .08 18 —.08 .20 
Unemployment 
Compensation 1.00 St .07 ell 14 
Intelligence 1.00 .61 74 .20 
erbal 1.00 52 16 
Spatial 1.00 20 
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The multiple regression equation for the initial group was: 
X, = —.153 + 01822 + 1247. + .001z, — .001 25 + .008% 


X; is the new prediction score; zs is the age score; 2з is the 
Unemployment Compensation score; z, is the Intelligence score; 
25 is the verbal score; and тв is the spatial score. 

The resulting multiple correlation for the initial group was .325 
with a standard error of estimate of .439. An F-test of this multiple 
correlation revealed F = 4.22 which was significant at the .01 
level. 

The same criterion and the same predictor equation were also 
used with the cross-validation group. The resulting multiple cor- 
relation for the cross-validation group was —.06. This correlation 
was not significant at the .05 level. 

None of the predictor variables found to be statistically signifi- 
cant with the initial group continued to be significant when used 
with the cross-validation group. 

Discussion. A cross-validation group was used to check the use- 
fulness of the prediction equation developed with the initial group. 
If the equation had yielded a significant correlation, the moderate 
correlation coefficient obtained initially between the five predic- 
tion variables and actual success or failure in MDTA training 
would have been reaffirmed. Since the multiple correlation coeffi- 
cient obtained with the cross-validation group was not significant, 
the equation cannot be used with future MDTA programs. 

The intelligence, age, spatial aptitude, and verbal aptitude varia- 
bles showed no significant relationship to trainee success in the 
cross-validation. Since these variables were significant predictors 
with the initial group, these negative results were not anticipated 
with the cross-validation group. Future investigators should be 
cautioned to include a cross-validation procedure in all studies of 
this type. Naive acceptance of the results with the initial group 
would have led to an unwarranted use of the prediction equation, 

The different unemployment percentages in Muskegon, Michi- 
gan, during the two time periods, created a possible important 
difference between the two sample populations used in this study. 
Unemployment averaged 4.8 per cent, during the time period in 
which the initial group members were being selected and trained 


(April 1962-March 1965). Unemployment averaged only 3.9 per 
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cent while the cross-validation group members were being trained: 
(June 1964-September 1965). This is a drop of 20 per cent in the 
average number of unemployed from the initial to the cross-vali- 
dation group. It is possible that trainees with high potential abil- 
ities and motivation to succeed were not able to find employment 
during the initial period, and so they enrolled in training classes 
and became successful trainees. During the high employment pe- 
riod, the man with high abilities and motivation may find em- 
ployment and, therefore, be excluded from the training programs. 
It is recommended that future investigations should control for the 
area unemployment rate. 
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SEMANTIC MEMORY ABILITY TESTS AS 
PREDICTORS OF FRESHMAN COLLEGE 
ACHIEVEMENT 


WAYNE 8. ZIMMERMAN 
California State College, Los Angeles 
STEPHEN W. BROWN- 
California State College, Dominguez Hills 
AND 
WILLIAM B. MICHAEL 
University of Southern California 


Purpose. The purpose of this short note is to report coeficients 
of validity for ten tests of semantic memory abilities which were 
developed within the framework of Guilford’s structure-of-intel- 
lect model (Brown, Guilford, and Hoepfner, 1966). The criteria 
employed were grade point averages (GPA) in each of ten de- 
partmental course offerings, as well as in all academic courses 
combined. It was thought that zero-order coefficients of correla- 
tion would afford a practical basis for evaluating the promise that 
these experimental tests might hold for academic prediction in a 
state college setting. 

Samples and procedures. The samples consisted of 165 male and 
203 female freshmen who were tested during orientation week 
prior to the 1964 fall semester registration at CSCLA. The corre- 
lations which are presented in Tables 1 and 2 were obtained 
through the WDCORR program on the IBM 7094 computer at the 
UCLA Western Data Processing Center. Information concerning 
what each of the tests cited in these tables was hypothesized to 
represent may be found in the detailed report by Brown, Guilford, 
and Hoepfner (1966). 

Findings. From the entries in Table 1 and Table 2 it is evident 
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that for the most part the validity coefficients are small and that 
relatively few are statistically significant. Indeed, the magnitudes 
of these coefficients with total grade point average offer little 
promise of practical validity for prediction purposes. Some high 
correlations obtained in specific departmental areas are of interest 
(note particularly correlations with GPA’s obtained for females in 
Music Department courses) ; but because of the small № в, it would 
be risky to conclude that the tests are valid in these instances 
without confirmation from additional administrations to larger 
samples. 

Conclusions. Although lengthening of several of these experi- 
mental tests might be expected to result in higher reliabilities and 
in modest increases in their validities, it appears doubtful that any 
one test would afford any substantial promise for academic predic- 
tion in terms of a criterion of grade point average. Morepver, 
there is no compelling reason to believe that use of any one of 
these memory tests in combination with well-known standardized 
examinations for college selection would yield coefficients of mul- 
tiple correlation substantially higher than the zero-order correla- 
tions associated with recognized standardized tests of scholastic 


aptitude so long as the grade point average is the sole criterion 
employed. 
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TEMPERAMENT AND ATTITUDE CORRELATES 
OF LEADERSHIP BEHAVIOR 


A. M. FOX 
Sam Houston State College 


Problem. The purpose of this study was to determine the extent 
to which the Guilford-Zimmerman Temperament Survey (GZTS) 
and the Minnesota Teacher Attitude Inventory (MTAI) scores 
may be useful for the prediction of leadership behavior of ele- 
mentary school principals as described by the Leader Behavior 
Description Questionnaire (LBDQ). 

Sample. The sample consisted of 77 elementary school princi- 
pals of the Dallas Independent School District, Dallas, Texas. 

Procedure. GZTS and МТАТ scores were obtained from each 
of 77 elementary school principals. Each of 10 teaching staff mem- 
bers selected at random from the individual school faculty list was 
asked to complete the LBDQ. The mean of these LBDQ scores 
was taken as the measure of leader behavior for each principal. 

For each of the 10 factor scores of the GZTS and the total 
MTAI score Pearson's product moment correlation coefficient (r) 


TABLE 1 

Significant Relationships of Independent Variables with LBDQ 

LBDQ GZTS Regression Standard Error 
Factor Factor T Equation of Estimate 

IS G 0.39 43.74 + .44 (G) 1.87 

A 0.29 44.38 + .30 (A) 1.95 

[o] A —0.55 54.29 — 1.50 (A) 4.52 

R 0.38 39.84 + 1.25 (В) 5.02 

8 —0.25 52.72 — 1.04 (8) 5.28 

M —0.29 50.82 — 0.73 (M) 5.21 
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was calculated to establish its relationship with the two factor 
scores of the LBDQ. 

For those correlation coefficients found to be significant at the 
05 level, the regression equation and the standard error of estimate 
were established, 

Results. The results of this study are summarized in Table 1. 
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PREDICTION OF PERFORMANCE IN 
STUDENT TEACHING 


JACK WILLIAMS дю A. M. FOX 
Sam Houston State College 


Problem. The purpose of this study was to determine the extent 
to which the Minnesota Teacher Attitude Inventory (MAI), the 
Thurstone Temperament Schedule (TTS), the Otis Mental Abil- 
ity Test (IQ), the American College Test (ACT), and the grade- 
point ratio (GPR) in area of teaching specialization may be useful 
for the prediction of performance in student teaching (EVAL) 
as measured by the evaluation sheet (ES) in use at Sam Houston 
State College. 

Sample. The sample consisted of 205 students at Sam Houston 
State College who were enrolled in their professional semester 
which included nine weeks of course work offered on campus 
followed by a nine weeks student teaching experience in one of 
eleven public school districts. The sample included 117 secondary 
and 88 elementary student teachers. 

Procedure. During the nine-week courses preceding the actual 
student teaching experience MAT and TTS were administered to 
each participant. Official college records were examined to obtain 
percentile scores on ACT, IQ measures, and GPR in area of 
specialization. 

During the fifth week of student teaching an ES was completed 
by both the cooperating teacher and the college supervisor. At the 
close of the student teaching period a joint ES was completed 
during a conference between cooperating teacher and college su- 
pervisor. The arithmetic mean of these three ESs was used as the 
measure of performance of the student teacher (EVAL). 

These data were processed in an IBM 1620 Data Processing 
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System using the Stepwise Regression Analysis Program. An F- 
ratio of 4.0 was used for the criterion in testing the significance of 
the relationships. 

Results. A significant relationship was found between GPR and 
EVAL. Mean GPR for the 205 student teachers was 2.83 (4 point 
system) with an S.D. of 0.50. Mean EVAL was 49.66 with an 
SD of 7.73. The regression equation was: EVAL = 3748 + 429 
(GPR) with the standard error of estimate 7.43. The correlation 
coefficient between GPR and EVAL was 0.28. 
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NOTE ON THE KIT OF REFERENCE TESTS! 


Н. У. STOKER ax» В. Р. KROPP 
Florida State University 


Tuis note deals with the mean performance, variability, and 
reliabilities of 37 tests of the Kit of Reference Tests for Cognitive 
Factors (French, Ekstrom, and Price, 1963) which were admin- 
istered to approximately 65 to 75 students from each of grades 
9, 10, and 11, and then re-administered to them ten months later. 

The 37 tests and the factors they purportedly measure are identi- 
fied in Table 1. When Part 1 and Part 2 of a selected test were 
administered, data were treated separately for each part and for the 
total of them. 

Four sets of tables are available; one set for each test-retest 
grade group (9th to 10th, 10th to 11th, and 11th to 12th) and one 
set for all grades combined. Each set of tables contains the mean 
and standard deviation for the first administration (April, 1965), 
the means and standard deviations based on performance 10 
months later (February, 1966), and test-retest correlations based 
on a 10 month time lapse. For each grade, the number of subjects 
involved in the calculation of values ranged from 60-75. For the 
total group statistics, the numbers ranged from 200-215. 

Copies of the tables are available in mimeograph form and may 
be obtained from the authors. . 


1The work reported herein was supported by the U. S. Office of Education 
contract number OE-4-10-019, project 2117. 
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TABLE 1 


Kit of Reference Tests Administered and Factor I dentification 


Test Name 


Hidden Figures Test 
Hidden Patterns Test 
Copying Test, 

Gestalt Completion Test 
Concealed Words Test 
Controlled Associations Test 


~ Associations IV 


Simile Interpretations 
Word Arrangements 
Topics Test 

Thing Categories Test 
Letter Sets Test 
Locations Test 
Figure Classification 
Estimation of Length Test 
Shortest Road Test 
Nearer Point Test 
Pieture-Number Test 


First and Last Names Test 


Addition Test 

Division Test 

Subtraction and Multiplication Test 
Plot Titles» 

Symbol Production 

Ship Destination Test 

Necessary Arithemtic Operations 
Gestalt Transformation 

Object Synthesis 

Nonsense Syllogisms Test 
Logical Reasoning Test 
Inference Test 

Apparatus Test 

Seeing Problems 

Vocabulary 

Wide Range Vocabulary Test 
Match Problems IT 

Planning Air Manuevers 


Factors 


Flexibility of Closure 
Flexibility of Closure 
Flexibility of Closure 
Speed of Closure 
Speed of Closure 
Associational Fluency 
Associational Fluency 
Expressional Fluency 
Expressional Fluency 
Ideational Fluency 
Ideational Fluency 
Induction 

Induction 

Induction 

Length Estimation 
Length Estimation 
Length Estimation 
Associative Memory 
Associative Memory 
Number Facility 
Number Facility 
Number Facility 
Originality 
Originality 

General Reasoning 
General Reasoning 
Semantic Redefinition 
Semantic Redefinition 
Syllogistic Reasoning 
Syllogistic Reasoning 
Syllogistic Reasoning 
Sensitivity to Problems 
Sensitivity to Problems 
Verbal Comprehension 
Verbal Comprehension 
Figural Adaptive Flexibility 


qi 


Figural Adaptive Flexibility 
"Scores on this test were judged to be invalid for ninth-grade students, 
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